Hi Dogacan,

Are you parsing during fetching? If you are ParseOutputFormat runs during
> reduce and that may be the slow part (because without parsing, fetch-reduce
> is
> just identity reduce)
>

I am parsing indeed but the reason might be in the use of a very greedy
regular expression in my regexurlfilter. Andrzej very kindly suggested that
I dump a stack. I did that several times and every time I got something like
:

*"main" prio=10 tid=0x08e10c00 nid=0x68ae runnable [0xb7d78000..0xb7d791f8]
   java.lang.Thread.State: RUNNABLE
    at java.lang.Character.codePointAt(Character.java:2335)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3737)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$Start.match(Pattern.java:3055)
    at java.util.regex.Matcher.search(Matcher.java:1105)
    at java.util.regex.Matcher.find(Matcher.java:535)*

the regex I am suspecting is :
*
# skip URLs containing sequences of more than 20 chars
-.*[^/]{20,}.**

I will remove it from my next iteration of fetching and we'll see what we
get. By default my configuration was specifying only one reducer - I'll set
that to a larger value and see what impact that has.

on a slightly different subject: is there a chance your patch 442 could be
added to the source? I am a big fan of it :-)

Thanks,

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to