[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564601#comment-13564601
 ] 

Tejas Patil commented on NUTCH-1465:
------------------------------------

Hi Sebastian,

By (“for a given host, sitemaps are processed just once”) I meant : in the same 
round, the processing is done just once for a given host. I agree with you that 
a sitemap is fetched and processed every cycle for every host. The 
SitemapInjector idea is good.

The way I see this: "SitemapInjector" will be a
- Separate map-reduce job 
- Responsible for fetching sitemaps and merging those urls with the crawldb. 
- For large web crawls, we dont want to run this job for every nutch cycle. 
Also, new hosts will be discovered on the way for which the sitemaps need to be 
added to the crawldb. So have a "sitemapFrequency" param to the crawl script. 
eg. If sitemapFrequency=10, sitemap job will be invoked in every 10 cycles of 
nutch crawl (1st cycle, 11th cycle, 21st cycle and so on). 
- Users can also run this job in standalone fashion on a crawldb.

What say ?
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to