[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alparslan Avcı updated NUTCH-1741:
----------------------------------

    Attachment: SitemapDevelopmentFor2x.pdf

I have uploaded a drawing that explains the way I want to implement this 
improvement. I have read all the comments and examine the patches in NUTCH-1465 
and prepared such a drawing. I propose to get sitemaps and their content in a 
crawl sprint. As you can see in the drawing, two new jobs (_SitemapInjectorJob_ 
and _SitemapParserJob_) has to be implemented. This approach is based on "(B) 
Have separate job for the sitemap stuff and merge its output into the crawldb." 
way mentioned in NUTCH-1465, but have some differences.

Pros for this approach:
* Seperates the jobs according to their tasks and gets more cleaner code
* Re-uses some already implemented codes
* Takes advantage of standard _FetcherJob_'s multi-threaded implementation and 
adaptive fetch schedule for sitemaps
* Parses the sitemap document with plugins like XML, RSS, plain text.
* Gives control of sitemap crawling to the users

Cons for this approach:
* A number of implementations is needed.
* Some little changes is also needed in standart _GeneratorJob_

Please feel free to write your comments about this issue and my approach. And 
thanks to the people who did contributed to NUTCH-1465.

> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
>                 Key: NUTCH-1741
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1741
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Alparslan Avcı
>             Fix For: 2.3
>
>         Attachments: SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to