[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564274#comment-13564274
]
Sebastian Nagel commented on NUTCH-1465:
----------------------------------------
Hi Tejas,
thanks and a few comments on the patch:
“??for a given host, sitemaps are processed just once??” But they are not
cached over cycles because the cache is bound to the protocol object. Is this
correct? So a sitemap is fetched and processed every cycle for every host? If
yes and sitemaps are large (they can!) this would cause a lot of extra traffic.
Shouldn't sitemap URLs handled the same way as any other URL: add them to
CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at
CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I].
There are some complications:
- due to their size, sitemaps may require larger values regarding size and time
limits
- sitemaps may require more frequent re-fetching (eg. by
MimeAdaptiveFetchSchedule)
- the current Outlink class cannot hold extra information contained in sitemaps
(lastmod, changefreq, etc.)
There is another way which we use it for several customers: A SitemapInjector
fetches the sitemaps, extracts URLs and injects them with all extra
information. It's a simple use case for a customized site-search: there is a
sitemap and it shall be used as seed list or even exclusive list of documents
to be crawled. Is there any interest in this solution? It's not a general
solution and not adaptable to a large web crawl.
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Tejas Patil
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira