[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564768#comment-13564768 ]
Sebastian Nagel commented on NUTCH-1465: ---------------------------------------- Yes, SitemapInjector is a map-reduce job. The scenario for its use is the following: - a small set of sites to be crawled (eg, to feed a site-search index) - you can think of sitemaps as "remote seed lists". Because many content management systems can generate sitemaps it is convenient for the site owners to publish seeds. The URLs contained in the sitemap can be also the complete and exclusive set of URLs to be crawled (you can use the plugin scoring-depth to limit the crawl to seed URLs). - because you can trust in the sitemap's content -* checks for "cross submissions" are not necessary -* extra information (lastmod, changefreq, priority) can be used That's we use sitemaps: remote seed lists, maintained by customers, quite convenient if you run a crawler as a service. For large web crawls there is also another aspect: detection of sitemaps which is bound to processing of robots.txt. Processing of sitemaps can (and should?) be done the usual Nutch way: - detection is done in the protocol plugin (see Tejas' patch) - record in CrawlDb: done by Fetcher (cross submission information can be added) - fetch (if not yet done), parse (a plugin parse-sitemap based on crawler-commons?) and extract outlinks: sitemaps may require special treatment here because they can be large in size and usually contain many outlinks. Also the Outlink class needs to be extended to deal with the extra info relevant for scheduling To use an extra tool (as the SitemapInjector) for processing the sitemaps has the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On the contrary, special treatment can easily be realized in a separate map-reduce job. Comments?! Thanks, Tejas: the feature is moving forward thanks to your initiative! > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.7 > > Attachments: NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira