Faster RegexNormalize with more features
----------------------------------------

                 Key: NUTCH-410
                 URL: http://issues.apache.org/jira/browse/NUTCH-410
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.8
         Environment: Tested on MacOS X 10.4.7/10.4.8
            Reporter: Doug Cook
            Priority: Minor


The patch associated with this is backwards-compatible and has several 
improvements over the stock 0.8 RegexURLNormalizer:

1) About a 34% performance improvement, from only executing the superclass 
(BasicURLNormalizer) once in most cases, instead of twice as the stock version 
did. 

2) Support for expensive host-specific normalizations with good performance. 
Each <regex> block optionally takes a list of hosts to which to apply the 
associated regex. If supplied, the regex will only be applied to these hosts. 
This should have scalable performance; the comparison is O(1) regardless of the 
number of hosts. The format is:

    <regex>
        <host>www.host1.com</host>
        <host>host2.site2.com</host>
        <pattern> my pattern here </pattern>
        <substitution> my substitution here </substitution>
   </regex>

3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
etc.). This is useful, for example, to decode "jump redirects" which have the 
target URL encoded within the source, as on Yahoo. I tried to create an 
extensible notion of "options," the first of which is "unescape." The unescape 
function is applied *after* the substitution and *only* if the substitution 
pattern matches. A simple pattern to unescape Yahoo directory redirects would 
be something like:

<regex>
  <pattern>^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&amp;]+)</pattern>
  <substitution>$1</substitution>
  <options>unescape</options>
</regex>

4) Added the notion of iterating the pattern chain. This is useful when the 
result of a normalization can itself be normalized. While some of this can be 
handled in the stock version by repeating patterns, or by careful ordering of 
patterns, the notion of iterating is cleaner and more powerful. The chain is 
defined to iterate only when the previous iteration changes the input, up to a 
configurable maxium number of iterations. The config parameter to change is: 
urlnormalizer.regex.maxiterations, which defaults to 1 (previous behavior). The 
change is performance-neutral when disabled, and has a relatively small 
performance cost when enabled.

Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
search engine experience, but Nutch is my first large Java app. I welcome any 
feedback, and hope this is useful.

Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to