Done. See http://issues.apache.org/jira/browse/NUTCH-409
This is my first Nutch contribution, so hopefully I've got it right ;-) Any suggestions/questions/feedback welcome. Hope this is useful to others. D scott green wrote: > > Hi Doug, > > Your idea about PrefixURLFilter and AutomatonURLFilter combination > sounds interesting. Could you please attach the patch to JIRA? Thanks > > - Scott > > On 11/17/06, Doug Cook <[EMAIL PROTECTED]> wrote: >> >> Hi, folks, >> >> I, too, was slowed down by reduce operations in fetch. Some benchmarking >> showed that in my case, the limiting operation was filtering (though a >> distant second was the time spent calculating Levenshtein distances, >> presumably part of the spellchecking that Sami just removed to speed >> things >> up, though I haven't looked at it yet). >> >> I've fixed the problem, and my reduce speed is better by about a factor >> of >> three. However, the fix is limited to certain usage patterns. >> >> In my case, I have tens of thousands of sites and subsites I'm crawling, >> and >> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I >> essentially use the prefix filter to limit to the set of sites, and then >> automaton to pattern-match within those sites. I only have subsite >> matches >> on < 10% of the sites, however, so I was clearly wasting a lot of time >> running the automaton patterns that didn't need it. And automaton, though >> much faster than RegexURLFilter, is still dog-slow with that many >> patterns. >> >> A simple fix was to extend the current "AND all the filters together" >> model >> to have the notion of a "short-circuit" match, which allows a filter to >> say >> "let this URL through and DON'T run the other filters" by returning a >> special token to URLFilters. Now I have a version of PrefixURLFilter that >> can return both "normal" matches and "short circuit" matches, and only >> returns "normal" matches for those sites that need to run subsite >> patterns. >> It seems to work well, the overhead is negligible when not in use, and >> the >> speedup is massive for my usage pattern. >> >> I'd like to contribute it back, if people would find this useful (not >> that >> it's rocket science!). >> >> First, is there anyone out there besides me who would find this useful? >> >> Second, I've been thinking about the best way to handle PrefixURLFilter >> configuration. I can see a few options: >> >> 1. Have two different config files, one for "normal" matches, and one for >> "short-circuit" matches. >> 2. Have one config file, with a syntax to say "make this pattern a >> short-circuit match," and make the default be a "normal" match, so it is >> backwards compatible with the current version. >> 3. Make a new type of filter which internally combines Prefix and >> Automaton, >> takes one config file, and decides internally which patterns should >> generate >> automaton inputs vs "normal" or "short circuit" prefix matches. >> >> Approach #3 requires no changes to the URLFilter model, and makes it >> difficult to screw up by making config files which are inconsistent (e.g. >> forgetting to put in a prefix pattern for one of the automaton patterns). >> It >> is also the least flexible, requires the most code, and introduces yet >> another kind of filter. >> >> I tend to like the changed URLFilter model; it's more flexible, even if >> it >> requires a little more care in configuration (a simple Perl script, in my >> case, to generate the config files correctly and consistently). I'm >> leaning >> towards approach #2. I'm thinking something simple, syntax-wise, like >> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any >> suggestions for a better syntax? Or reasons why I should consider a >> different approach? >> >> Doug >> >> -- >> View this message in context: >> http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430 >> Sent from the Nutch - Dev mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7543634 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers