[ 
https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255936#comment-13255936
 ] 

Markus Jelsma commented on NUTCH-1339:
--------------------------------------

The anchor is still removed by the BasicURLNormalizer. We worked around the 
problem for the AJAXNormalizer by simply changing the normalizer order. First 
the AJAXNormalizer and then everything else. But, when indexing, first do the 
BasicNormalizer (if enabled) and only then the AJAXNormalizer.

                
> Default URL normalization rules to remove page anchors completely
> -----------------------------------------------------------------
>
>                 Key: NUTCH-1339
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1339
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: nutchgora, 1.6
>            Reporter: Sebastian Nagel
>         Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch
>
>
> The default rules of URLNormalizerRegex remove the anchor up to the first
> occurrence of ? or &. The remaining part of the anchor is kept
> which may cause a large, possibly infinite number of outlinks when the same 
> document
> fetched again and again with different URLs,
> see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html
> Parameters in inner-page anchors are a common practice in AJAX web sites.
> Currently, crawling AJAX content is not supported (NUTCH-1323).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to