[ 
https://issues.apache.org/jira/browse/NUTCH-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154614#comment-17154614
 ] 

Sebastian Nagel commented on NUTCH-2798:
----------------------------------------

{quote}So What should i do if i need higher size limit to crawl big html pages 
for eg 400kb
{quote}
 

You need to modify the configuration, ideally by adding the modified 
http.content.limit property to your nutch-site.xml, see 
[https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial#NutchTutorial-Customizeyourcrawlproperties]

Please consider to ask for help about how to use and configure Nutch on the 
[Nutch user mailing list|https://nutch.apache.org/mailing_lists.html] or on 
Stackoverflow. Thanks!

> Nutch v2.4 Not Able to crawl after javax.faces.viewstate
> --------------------------------------------------------
>
>                 Key: NUTCH-2798
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2798
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.4
>         Environment: Ubuntu mate
>            Reporter: Mihir Sharma
>            Priority: Critical
>         Attachments: image-2020-07-06-20-20-49-580.png, 
> image-2020-07-06-20-22-07-351.png, image-2020-07-09-19-43-28-586.png
>
>
> Nutch v2.4 Not crawling The html page After input tag with name 
> javax.faces.viewstate it is crawling before this tag but unable to go ahead 
> after this javax viewstate which is having a lot special character.
> This page is having different tabs, Current crawler is fetching information 
> till date(
> Date Published: 06/30/2020 09:00 PM) After that it is unable to fetch from 
> *Assembly Bill No. 103* which is title
> i m crawling this site: 
> [http://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201920200AB103]
>  
> !image-2020-07-06-20-20-49-580.png!
>  
> This is the output i am getting after crawling.
>  
> !image-2020-07-06-20-22-07-351.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to