I have a very strange, reproducible bug that shows up when running
fetch across any number of documents >10000.  I'm running 47 map tasks
and 47 reduce tasks on 24 nodes.  The map phase finishes fine and so
does the majority of the reduce phase, however there are always two
segments that perpetually hang in the reduce > reduce phase.  What
happens is the reducer gets to 85.xx% and then stops responding.  Once
10 minutes go by, a new worker starts the task, gets to the same
85.xx(+/- .1%) and hangs.  The other consistent part is that it's
always segment 2 and segment 5 (out of 47 segments).

I figured I could fix it by simply copying data from a different
segment in and continuing on the next iteration, but low and behold
the same exact problem happens in segment 2 and segment 5.

I assume it's not IO problems because all of the nodes involved in
these segments finish other reduce tasks in the same iteration with no
problems.  Furthermore, I have seen this happen persistently over the
last many iterations.  My last iteration had 400,000 (+/-) documents
pulled down and I saw the same behavior.

Does anyone have any suggestions?

-- 
Ned Rockson
Discovery Engine
795 Folsom Street
San Francisco, CA 94107

Reply via email to