[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Sebastian Nagel (JIRA) Mon, 28 Jan 2013 14:31:14 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564768#comment-13564768
 ]


Sebastian Nagel commented on NUTCH-1465:
----------------------------------------



Yes, SitemapInjector is a map-reduce job. The scenario for its use is the 
following:
- a small set of sites to be crawled (eg, to feed a site-search index)
- you can think of sitemaps as "remote seed lists". Because many content 
management systems can generate sitemaps it is convenient for the site owners 
to publish seeds. The URLs contained in the sitemap can be also the complete 
and exclusive set of URLs to be crawled (you can use the plugin scoring-depth 
to limit the crawl to seed URLs).
- because you can trust in the sitemap's content
-* checks for "cross submissions" are not necessary
-* extra information (lastmod, changefreq, priority) can be used
That's we use sitemaps: remote seed lists, maintained by customers, quite 
convenient if you run a crawler as a service.

For large web crawls there is also another aspect: detection of sitemaps which 
is bound to processing of robots.txt. Processing of sitemaps can (and should?) 
be done the usual Nutch way:
- detection is done in the protocol plugin (see Tejas' patch)
- record in CrawlDb: done by Fetcher (cross submission information can be added)
- fetch (if not yet done), parse (a plugin parse-sitemap based on 
crawler-commons?) and extract outlinks: sitemaps may require special treatment 
here because they can be large in size and usually contain many outlinks. Also 
the Outlink class needs to be extended to deal with the extra info relevant for 
scheduling
To use an extra tool (as the SitemapInjector) for processing the sitemaps has 
the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On 
the contrary, special treatment can easily be realized in a separate map-reduce 
job.

Comments?!

Thanks, Tejas: the feature is moving forward thanks to your initiative!
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Reply via email to