[ https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404889#comment-16404889 ]
Sebastian Nagel commented on NUTCH-2509: ---------------------------------------- Thanks, [~yossi]! The redir issue is already fixed in NUTCH-2521 (sorry, I haven't seen this issue). I've found that also the URLs of sitemaps referenced in the robots.txt are not filtered/normalize. I'll open a PR to address this as well. > Inconsistent behavior in SitemapProcessor > ----------------------------------------- > > Key: NUTCH-2509 > URL: https://issues.apache.org/jira/browse/NUTCH-2509 > Project: Nutch > Issue Type: Bug > Components: sitemap > Affects Versions: 1.14 > Reporter: Yossi Tamari > Priority: Minor > Fix For: 1.15 > > Attachments: SitemapProcessor.patch > > > There are two inconsistent behaviors in SitemapProcessor: > # There is a member variable maxRedir that is supposed to limit the number > of redirections on sitemap URLs, and it is initialized from config property > sitemap.redir.max, but it is ignored in the code because a local variable > with the same name is defined in the relevant method, and is always set to 3. > # When a sitemap URL goes through redirect, it is filtered and normalized. > However, if a sitemap URL comes from a sitemapindex, it is not. This seems > inconsistent, as in both cases we have a URL from an outside source. -- This message was sent by Atlassian JIRA (v7.6.3#76005)