Done. See http://issues.apache.org/jira/browse/NUTCH-409

This is my first Nutch contribution, so hopefully I've got it right ;-) Any
suggestions/questions/feedback welcome.

Hope this is useful to others.

D


scott green wrote:
> 
> Hi Doug,
> 
> Your idea about PrefixURLFilter and  AutomatonURLFilter combination
> sounds interesting. Could you please attach the patch to JIRA? Thanks
> 
> - Scott
> 
> On 11/17/06, Doug Cook <[EMAIL PROTECTED]> wrote:
>>
>> Hi, folks,
>>
>> I, too, was slowed down by reduce operations in fetch. Some benchmarking
>> showed that in my case, the limiting operation was filtering (though a
>> distant second was the time spent calculating Levenshtein distances,
>> presumably part of the spellchecking that Sami just removed to speed
>> things
>> up, though I haven't looked at it yet).
>>
>> I've fixed the problem, and my reduce speed is better by about a factor
>> of
>> three. However, the fix is limited to certain usage patterns.
>>
>> In my case, I have tens of thousands of sites and subsites I'm crawling,
>> and
>> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
>> essentially use the prefix filter to limit to the set of sites, and then
>> automaton to pattern-match within those sites. I only have subsite
>> matches
>> on < 10% of the sites, however, so I was clearly wasting a lot of time
>> running the automaton patterns that didn't need it. And automaton, though
>> much faster than RegexURLFilter, is still dog-slow with that many
>> patterns.
>>
>> A simple fix was to extend the current "AND all the filters together"
>> model
>> to have the notion of a "short-circuit" match, which allows a filter to
>> say
>> "let this URL through and DON'T run the other filters" by returning a
>> special token to URLFilters. Now I have a version of PrefixURLFilter that
>> can return both "normal" matches and "short circuit" matches, and only
>> returns "normal" matches for those sites that need to run subsite
>> patterns.
>> It seems to work well, the overhead is negligible when not in use, and
>> the
>> speedup is massive for my usage pattern.
>>
>> I'd like to contribute it back, if people would find this useful (not
>> that
>> it's rocket science!).
>>
>> First, is there anyone out there besides me who would find this useful?
>>
>> Second, I've been thinking about the best way to handle PrefixURLFilter
>> configuration. I can see a few options:
>>
>> 1. Have two different config files, one for "normal" matches, and one for
>> "short-circuit" matches.
>> 2. Have one config file, with a syntax to say "make this pattern a
>> short-circuit match," and make the default be a "normal" match, so it is
>> backwards compatible with the current version.
>> 3. Make a new type of filter which internally combines Prefix and
>> Automaton,
>> takes one config file, and decides internally which patterns should
>> generate
>> automaton inputs vs "normal" or "short circuit" prefix matches.
>>
>> Approach #3 requires no changes to the URLFilter model, and makes it
>> difficult to screw up by making config files which are inconsistent (e.g.
>> forgetting to put in a prefix pattern for one of the automaton patterns).
>> It
>> is also the least flexible, requires the most code, and introduces yet
>> another kind of filter.
>>
>> I tend to like the changed URLFilter model; it's more flexible, even if
>> it
>> requires a little more care in configuration (a simple Perl script, in my
>> case, to generate the config files correctly and consistently). I'm
>> leaning
>> towards approach #2. I'm thinking something simple, syntax-wise, like
>> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
>> suggestions for a  better syntax? Or reasons why I should consider a
>> different approach?
>>
>> Doug
>>
>> --
>> View this message in context:
>> http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7543634
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to