[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886453#comment-13886453
]
Sebastian Nagel commented on NUTCH-1465:
----------------------------------------
Thanks, [~tejasp] for the improvements! Testings continued...
Sitemaps are treated same as ordinary URLs/docs. But there are some
differences. Shouldn't we relax default limits and filters and trust the
restrictions specified in sitemap protocol?
* URL filters and normalizers: maybe you want to exclude .gz docs per suffix
filter but still fetch gzipped sitemaps. That's not possible. Is it really
necessary to normalize/filter sitemap URLs? If yes, this should be optional.
* default content limits {http,ftp,file}.content.limit (64 kB) are quite small
even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not
increase it to SiteMapParser.MAX_BYTES_ALLOWED?
* maybe we want also increase the fetch timeout
Processing siitemap indexes fails:
* the check sitemap.isIndex() skips all referenced sitemaps
* protocol for sitemap index and referenced sub-sitemaps may be different (eg.,
one sub-sitemap could be https while others are http)
* if processing one of the referenced sitemaps fails, the remaining
sub-sitemaps are not processed
Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to
reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <=
db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause
troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]
explicitely says that <changefreq> "is considered a hint and not a command".
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch,
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch,
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch,
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)