[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564768#comment-13564768
]
Sebastian Nagel commented on NUTCH-1465:
----------------------------------------
Yes, SitemapInjector is a map-reduce job. The scenario for its use is the
following:
- a small set of sites to be crawled (eg, to feed a site-search index)
- you can think of sitemaps as "remote seed lists". Because many content
management systems can generate sitemaps it is convenient for the site owners
to publish seeds. The URLs contained in the sitemap can be also the complete
and exclusive set of URLs to be crawled (you can use the plugin scoring-depth
to limit the crawl to seed URLs).
- because you can trust in the sitemap's content
-* checks for "cross submissions" are not necessary
-* extra information (lastmod, changefreq, priority) can be used
That's we use sitemaps: remote seed lists, maintained by customers, quite
convenient if you run a crawler as a service.
For large web crawls there is also another aspect: detection of sitemaps which
is bound to processing of robots.txt. Processing of sitemaps can (and should?)
be done the usual Nutch way:
- detection is done in the protocol plugin (see Tejas' patch)
- record in CrawlDb: done by Fetcher (cross submission information can be added)
- fetch (if not yet done), parse (a plugin parse-sitemap based on
crawler-commons?) and extract outlinks: sitemaps may require special treatment
here because they can be large in size and usually contain many outlinks. Also
the Outlink class needs to be extended to deal with the extra info relevant for
scheduling
To use an extra tool (as the SitemapInjector) for processing the sitemaps has
the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On
the contrary, special treatment can easily be realized in a separate map-reduce
job.
Comments?!
Thanks, Tejas: the feature is moving forward thanks to your initiative!
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Tejas Patil
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira