[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564274#comment-13564274
 ] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

Hi Tejas,
thanks and a few comments on the patch:

“??for a given host, sitemaps are processed just once??” But they are not 
cached over cycles because the cache is bound to the protocol object. Is this 
correct? So a sitemap is fetched and processed every cycle for every host? If 
yes and sitemaps are large (they can!) this would cause a lot of extra traffic.

Shouldn't sitemap URLs handled the same way as any other URL: add them to 
CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at 
CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I].
 There are some complications:
- due to their size, sitemaps may require larger values regarding size and time 
limits
- sitemaps may require more frequent re-fetching (eg. by 
MimeAdaptiveFetchSchedule)
- the current Outlink class cannot hold extra information contained in sitemaps 
(lastmod, changefreq, etc.)

There is another way which we use it for several customers: A SitemapInjector 
fetches the sitemaps, extracts URLs and injects them with all extra 
information. It's a simple use case for a customized site-search: there is a 
sitemap and it shall be used as seed list or even exclusive list of documents 
to be crawled. Is there any interest in this solution? It's not a general 
solution and not adaptable to a large web crawl. 

                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to