[ 
https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404889#comment-16404889
 ] 

Sebastian Nagel commented on NUTCH-2509:
----------------------------------------

Thanks, [~yossi]! The redir issue is already fixed in NUTCH-2521 (sorry, I 
haven't seen this issue). I've found that also the URLs of sitemaps referenced 
in the robots.txt are not filtered/normalize. I'll open a PR to address this as 
well.

> Inconsistent behavior in SitemapProcessor
> -----------------------------------------
>
>                 Key: NUTCH-2509
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2509
>             Project: Nutch
>          Issue Type: Bug
>          Components: sitemap
>    Affects Versions: 1.14
>            Reporter: Yossi Tamari
>            Priority: Minor
>             Fix For: 1.15
>
>         Attachments: SitemapProcessor.patch
>
>
> There are two inconsistent behaviors in SitemapProcessor:
>  # There is a member variable maxRedir that is supposed to limit the number 
> of redirections on sitemap URLs, and it is initialized from config property 
> sitemap.redir.max, but it is ignored in the code because a local variable 
> with the same name is defined in the relevant method, and is always set to 3.
>  # When a sitemap URL goes through redirect, it is filtered and normalized. 
> However, if a sitemap URL comes from a sitemapindex, it is not. This seems 
> inconsistent, as in both cases we have a URL from an outside source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to