[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886453#comment-13886453 ]
Sebastian Nagel commented on NUTCH-1465: ---------------------------------------- Thanks, [~tejasp] for the improvements! Testings continued... Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't we relax default limits and filters and trust the restrictions specified in sitemap protocol? * URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap URLs? If yes, this should be optional. * default content limits {http,ftp,file}.content.limit (64 kB) are quite small even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED? * maybe we want also increase the fetch timeout Processing siitemap indexes fails: * the check sitemap.isIndex() skips all referenced sitemaps * protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap could be https while others are http) * if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]] explicitely says that <changefreq> "is considered a hint and not a command". > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)