[ 
https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258827#comment-13258827
 ] 

Sebastian Nagel commented on NUTCH-1339:
----------------------------------------

BasicURLNormalizer does not remove the anchor for https URLs (NUTCH-1344).
At least, in my case this was the real reason for the large number of bad URLs.

The only motivation to remove the anchor not completely is the rare case that 
anchor and query parameters are accidentally swapped.
                
> Default URL normalization rules to remove page anchors completely
> -----------------------------------------------------------------
>
>                 Key: NUTCH-1339
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1339
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: nutchgora, 1.6
>            Reporter: Sebastian Nagel
>         Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch
>
>
> The default rules of URLNormalizerRegex remove the anchor up to the first
> occurrence of ? or &. The remaining part of the anchor is kept
> which may cause a large, possibly infinite number of outlinks when the same 
> document
> fetched again and again with different URLs,
> see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html
> Parameters in inner-page anchors are a common practice in AJAX web sites.
> Currently, crawling AJAX content is not supported (NUTCH-1323).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to