[
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alparslan Avcı updated NUTCH-1741:
----------------------------------
Attachment: SitemapDevelopmentFor2x.pdf
I have uploaded a drawing that explains the way I want to implement this
improvement. I have read all the comments and examine the patches in NUTCH-1465
and prepared such a drawing. I propose to get sitemaps and their content in a
crawl sprint. As you can see in the drawing, two new jobs (_SitemapInjectorJob_
and _SitemapParserJob_) has to be implemented. This approach is based on "(B)
Have separate job for the sitemap stuff and merge its output into the crawldb."
way mentioned in NUTCH-1465, but have some differences.
Pros for this approach:
* Seperates the jobs according to their tasks and gets more cleaner code
* Re-uses some already implemented codes
* Takes advantage of standard _FetcherJob_'s multi-threaded implementation and
adaptive fetch schedule for sitemaps
* Parses the sitemap document with plugins like XML, RSS, plain text.
* Gives control of sitemap crawling to the users
Cons for this approach:
* A number of implementations is needed.
* Some little changes is also needed in standart _GeneratorJob_
Please feel free to write your comments about this issue and my approach. And
thanks to the people who did contributed to NUTCH-1465.
> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, generator
> Reporter: Alparslan Avcı
> Fix For: 2.3
>
> Attachments: SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed
> in NUTCH-1465 for trunk.
--
This message was sent by Atlassian JIRA
(v6.2#6252)