[
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954983#comment-13954983
]
Alparslan Avcı commented on NUTCH-1741:
---------------------------------------
Hi [~wastl-nagel],
Thank you for your comments. I have written my opinions for your questions.
* _"takes advantage of standard FetcherJob ..."_
** _what about sitemap indexes? They can't be fetched in one turn, yet, cannot
be hold in one web table row because a sitemap index has multiple URLs._
>> Sitemaps in sitemap indexes will be parsed in two crawl sprints. At first,
>> the sitemap urls in a sitemap index will be put into table; and secondly the
>> new urls in these sitemaps will be put into table.
** _do we really need queues and politeness when fetching only sitemaps?
There's rarely more than one sitemap per host._
>> I agree that the majority of the hosts has no more than one sitemap.
>> However, some sites which has frequently changing info (like e-commerce
>> sites) has lots of sitemaps and sitemap indexes. IMHO, we have to implement
>> a solution for these sites in order to get their new urls asap.
** _"adaptive fetch schedule for sitemaps": that's an interesting idea, it may
help in case of forgotten and hopelessly outdated sitemaps. But isn't a sitemap
more like robots.txt? – only cached for a short time and re-fetched within
short periods because a fresh sitemap may contain fresh links_
>> Actually, this is something that we can know by experience. As you said, the
>> sitemap crawler has to be run in short periods of time for
>> frequently-updated sitemaps. But IMHO this crawler does not need to fetch
>> the outdated sitemaps again.
* _"SitemapParserJob": that's a combination of parser + updatedb, right?_
>> Yes, that is right.
* _"Parses the sitemap document with plugins like XML, RSS, plain text."_
** _Does it mean these plugin(s) has/have to be written?_
>> Yes, that is aslo right. First plugin will be based on crawler-commons.
> Support of Sitemaps in Nutch 2.x
> --------------------------------
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, generator
> Reporter: Alparslan Avcı
> Fix For: 2.3
>
> Attachments: SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed
> in NUTCH-1465 for trunk.
--
This message was sent by Atlassian JIRA
(v6.2#6252)