[ 
https://issues.apache.org/jira/browse/NUTCH-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659429#comment-16659429
 ] 

Sebastian Nagel commented on NUTCH-2511:
----------------------------------------

The easiest way would be to increase the limit by calling 
{{conf.setInt("http.content.limit", SiteMapParser.MAX_BYTES_ALLOWED);}} in the 
setup(context) method of SitemapMapper. Ev. make it configurable because 50 MB 
may take very long if this is done multiple time per sitemap index and for less 
responsive hosts. Also the SiteMapParser now supports to process truncated 
sitemaps - should be enabled. 

> SitemapProcessor limited by http.content.limit
> ----------------------------------------------
>
>                 Key: NUTCH-2511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2511
>             Project: Nutch
>          Issue Type: Bug
>          Components: sitemap
>    Affects Versions: 1.14
>            Reporter: Yossi Tamari
>            Priority: Major
>             Fix For: 1.16
>
>
> Because SitemapProcessor uses the HTTP protocol plugin, which limits the size 
> of a response to http.content.limit (64KB by default), it can only handle 
> sitemaps smaller than that size. 
> I don't believe that is the intent of the users by setting http.content.limit 
> - they want to limit document size, not sitemap size. The spec specifically 
> says that sitemaps can be up to 50MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to