[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886453#comment-13886453
 ] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

Thanks, [~tejasp] for the improvements! Testings continued...

Sitemaps are treated same as ordinary URLs/docs. But there are some 
differences. Shouldn't we relax default limits and filters and trust the 
restrictions specified in sitemap protocol?
* URL filters and normalizers: maybe you want to exclude .gz docs per suffix 
filter but still fetch gzipped sitemaps. That's not possible. Is it really 
necessary to normalize/filter sitemap URLs? If yes, this should be optional.
* default content limits {http,ftp,file}.content.limit (64 kB) are quite small 
even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not 
increase it to SiteMapParser.MAX_BYTES_ALLOWED?
* maybe we want also increase the fetch timeout

Processing siitemap indexes fails:
* the check sitemap.isIndex() skips all referenced sitemaps
* protocol for sitemap index and referenced sub-sitemaps may be different (eg., 
one sub-sitemap could be https while others are http)
* if processing one of the referenced sitemaps fails, the remaining 
sub-sitemaps are not processed

Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to 
reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <= 
db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause 
troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]] 
explicitely says that <changefreq> "is considered a hint and not a command".


> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to