Migrate BasicURLNormalizer from Apache ORO to java.util.regex
-------------------------------------------------------------
Key: NUTCH-1062
URL: https://issues.apache.org/jira/browse/NUTCH-1062
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.4, 2.0
Issue for migration from ORO to j.u.regex. There is a small problem here. I
began the migration mostly because of the double slash issue using lookback
which was not supported in ORO. This was to prevent the URL schema from being
reduced to one slash. The current Basic URL Normalizer has this problem
built-in!
{code}
// this pattern tries to find spots like "xx//yy" in the url,
// which could be replaced by a "/"
adjacentSlashRule = new Rule();
adjacentSlashRule.pattern = (Perl5Pattern)
compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);
adjacentSlashRule.substitution = new Perl5Substitution("/");
{code}
But provides the wrong solution as it touches the schema as well. What to do?
Migrate to j.u.regex and keep this `feature` intact?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira