[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954983#comment-13954983
 ] 

Alparslan Avcı commented on NUTCH-1741:
---------------------------------------

Hi [~wastl-nagel],

Thank you for your comments. I have written my opinions for your questions.

* _"takes advantage of standard FetcherJob ..."_
** _what about sitemap indexes? They can't be fetched in one turn, yet, cannot 
be hold in one web table row because a sitemap index has multiple URLs._
>> Sitemaps in sitemap indexes will be parsed in two crawl sprints. At first, 
>> the sitemap urls in a sitemap index will be put into table; and secondly the 
>> new urls in these sitemaps will be put into table.
** _do we really need queues and politeness when fetching only sitemaps? 
There's rarely more than one sitemap per host._
>> I agree that the majority of the hosts has no more than one sitemap. 
>> However, some sites which has frequently changing info (like e-commerce 
>> sites) has lots of sitemaps and sitemap indexes. IMHO, we have to implement 
>> a solution for these sites in order to get their new urls asap.
** _"adaptive fetch schedule for sitemaps": that's an interesting idea, it may 
help in case of forgotten and hopelessly outdated sitemaps. But isn't a sitemap 
more like robots.txt? – only cached for a short time and re-fetched within 
short periods because a fresh sitemap may contain fresh links_
>> Actually, this is something that we can know by experience. As you said, the 
>> sitemap crawler has to be run in short periods of time for 
>> frequently-updated sitemaps. But IMHO this crawler does not need to fetch 
>> the outdated sitemaps again.
* _"SitemapParserJob": that's a combination of parser + updatedb, right?_
>> Yes, that is right.
* _"Parses the sitemap document with plugins like XML, RSS, plain text."_
** _Does it mean these plugin(s) has/have to be written?_
>> Yes, that is aslo right. First plugin will be based on crawler-commons.

> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
>                 Key: NUTCH-1741
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1741
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Alparslan Avcı
>             Fix For: 2.3
>
>         Attachments: SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to