[ 
https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1062:
---------------------------------

    Fix Version/s:     (was: 1.4)
                       (was: 2.0)
                   1.5
    
> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
>                 Key: NUTCH-1062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I 
> began the migration mostly because of the double slash issue using lookback 
> which was not supported in ORO. This was to prevent the URL schema from being 
> reduced to one slash. The current Basic URL Normalizer has this problem 
> built-in!
> {code}
>         // this pattern tries to find spots like "xx//yy" in the url,
>         // which could be replaced by a "/"
>         adjacentSlashRule = new Rule();
>         adjacentSlashRule.pattern = (Perl5Pattern)      
>           compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
>         adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? 
> Migrate to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash 
> is added for URI schema's http & ftp.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to