Hi Dogacan,
Are you parsing during fetching? If you are ParseOutputFormat runs during
> reduce and that may be the slow part (because without parsing, fetch-reduce
> is
> just identity reduce)
>
I am parsing indeed but the reason might be in the use of a very greedy
regular expression in my regexurlfilter. Andrzej very kindly suggested that
I dump a stack. I did that several times and every time I got something like
:
*"main" prio=10 tid=0x08e10c00 nid=0x68ae runnable [0xb7d78000..0xb7d791f8]
java.lang.Thread.State: RUNNABLE
at java.lang.Character.codePointAt(Character.java:2335)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
at java.util.regex.Pattern$Curly.match(Pattern.java:3737)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
at java.util.regex.Pattern$Start.match(Pattern.java:3055)
at java.util.regex.Matcher.search(Matcher.java:1105)
at java.util.regex.Matcher.find(Matcher.java:535)*
the regex I am suspecting is :
*
# skip URLs containing sequences of more than 20 chars
-.*[^/]{20,}.**
I will remove it from my next iteration of fetching and we'll see what we
get. By default my configuration was specifying only one reducer - I'll set
that to a larger value and see what impact that has.
on a slightly different subject: is there a chance your patch 442 could be
added to the source? I am a big fan of it :-)
Thanks,
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com