[
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409710#comment-16409710
]
Sebastian Nagel commented on NUTCH-1741:
----------------------------------------
There are a couple of reasons (but these apply mostly to the solution in 1.x
where sitemap processing resembles the inject step):
* sitemaps are allowed to be up to [50 MB in
size|https://www.sitemaps.org/protocol.html#index], common settings for
http.content.limit and network timeouts do not fit this size. It would require
greater changes to make this settings local to pages / individual fetches.
* ev. you want to process sitemap indexes recursively without waiting for the
next cycle or with the need to keep a limit on the max. number of add URLs.
Sitemaps can flood the crawler with many URLs. Theoretically a single sitemap
index and it's subsitemaps may contain 2.5 billion URLs. Although I haven't
seen so many, even 10 million URLs may choke up your crawler. I'm facing this
effect from time to time with Storm-crawler [crawling
news|http://commoncrawl.org/2016/10/news-dataset-available/]: either you have
to wait several weeks (if your crawler is polite) until this 10 million URLs
from a single host/domain are fetched or you have to delete them from CrawlDb
again.
* URLs from sitemaps provide some extra information (publication date, score)
Of course, you could solve all this points also when "inlining" the sitemap
crawling, but it would require some changes in the architecture.
> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, generator
> Reporter: Alparslan Avcı
> Assignee: Cihad Guzel
> Priority: Major
> Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch,
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch,
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch,
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed
> in NUTCH-1465 for trunk.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)