[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409710#comment-16409710
 ] 

Sebastian Nagel commented on NUTCH-1741:
----------------------------------------

There are a couple of reasons (but these apply mostly to the solution in 1.x 
where sitemap processing resembles the inject step):
 *  sitemaps are allowed to be up to [50 MB in 
size|https://www.sitemaps.org/protocol.html#index], common settings for 
http.content.limit and network timeouts do not fit this size. It would require 
greater changes to make this settings local to pages / individual fetches.
 * ev. you want to process sitemap indexes recursively without waiting for the 
next cycle or with the need to keep a limit on the max. number of add URLs. 
Sitemaps can flood the crawler with many URLs. Theoretically a single sitemap 
index and it's subsitemaps may contain 2.5 billion URLs. Although I haven't 
seen so many, even 10 million URLs may choke up your crawler. I'm facing this 
effect from time to time with Storm-crawler [crawling 
news|http://commoncrawl.org/2016/10/news-dataset-available/]: either you have 
to wait several weeks (if your crawler is polite) until this 10 million URLs 
from a single host/domain are fetched or you have to delete them from CrawlDb 
again.
 * URLs from sitemaps provide some extra information (publication date, score)

Of course, you could solve all this points also when "inlining" the sitemap 
crawling, but it would require some changes in the architecture.

> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
>                 Key: NUTCH-1741
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1741
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Alparslan Avcı
>            Assignee: Cihad Guzel
>            Priority: Major
>              Labels: gsoc2015
>             Fix For: 2.4
>
>         Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to