(sorry if this is a repost, I'm not sure if it sent last time). I have a very strange, reproducible bug that shows up when running fetch across any number of documents >10000. I'm running 47 map tasks and 47 reduce tasks on 24 nodes. The map phase finishes fine and so does the majority of the reduce phase, however there are always two segments that perpetually hang in the reduce > reduce phase. What happens is the reducer gets to 85.xx% and then stops responding. Once 10 minutes go by, a new worker starts the task, gets to the same 85.xx(+/- .1%) and hangs. The other consistent part is that it's always segment 2 and segment 5 (out of 47 segments).
I figured I could fix it by simply copying data from a different segment in and continuing on the next iteration, but low and behold the same exact problem happens in segment 2 and segment 5. I assume it's not IO problems because all of the nodes involved in these segments finish other reduce tasks in the same iteration with no problems. Furthermore, I have seen this happen persistently over the last many iterations. My last iteration had 400,000 (+/-) documents pulled down and I saw the same behavior. Does anyone have any suggestions? -- Ned Rockson Discovery Engine 795 Folsom Street San Francisco, CA 94107
