Ned Rockson wrote:
(sorry if this is a repost, I'm not sure if it sent last time).

I have a very strange, reproducible bug that shows up when running
fetch across any number of documents >10000.  I'm running 47 map tasks
and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
does the majority of the reduce phase, however there are always two
segments that perpetually hang in the reduce > reduce phase.  What
happens is the reducer gets to 85.xx% and then stops responding.  Once
10 minutes go by, a new worker starts the task, gets to the same
85.xx(+/- .1%) and hangs.  The other consistent part is that it's
always segment 2 and segment 5 (out of 47 segments).

I figured I could fix it by simply copying data from a different
segment in and continuing on the next iteration, but low and behold
the same exact problem happens in segment 2 and segment 5.

I assume it's not IO problems because all of the nodes involved in
these segments finish other reduce tasks in the same iteration with no
problems.  Furthermore, I have seen this happen persistently over the
last many iterations.  My last iteration had 400,000 (+/-) documents
pulled down and I saw the same behavior.

Does anyone have any suggestions?


Yes. Most likely this is a problem with urlfilter-regex getting stuck on an abnormal URL (such as e.g. extremely long url, or url that contains control characters).

Please check the Jobtracker UI which task is stuck, and on which machine it's executing. Log in to that machine, and identify what is the pid of this task process, and then generate a thread dump (using 'kill -SIGQUIT', which does NOT quit the process). If the thread dump shows some threads being stuck in regex code then it's likely that this is the problem.

The solution is to avoid urlfilter-regex, or to change the order of urlfilters and put simpler filters in front of urlfilter-regex, in the hope that they will eliminate abnormal urls before they are passed to urlfilter-regex.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to