Gang,

I recently tried to index 40 segments at once. The job nearly completed (99.999214%?), but the last two reduce tasks stalled out at 50% complete.

I think this problem would probably be solved by increasing the ipc.client.timeout value again. However, this was the first time I've tried to index 40 segments at once. I think a better solution is to index only 20 at a time, then merge more "indexes" directories at once (assuming that doesn't also choke) to create the final index directory on the slave. That's what I'm about to do right now.

Details of my analysis:

Job % kept increasing until it finally got pegged at 99.999214%.

task_r_wv3zz8 was first assigned to tracker_48357 at 10:30am this morning:

060211 102748 parsing /home/crawler/tmp/local/taskTracker/task_r_wv3zz8/job.xml
060211 102748 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060211 102748 task_r_juxr5i 0.0037815126% reduce > copy >
060211 102749 task_r_vvic99 3.1512606E-4% reduce > copy >
060211 102749 task_r_ohc2qm 2.3634454E-4% reduce > copy >
060211 102749 task_r_df5hab 0.0058035715% reduce > copy >
060211 102749 task_r_juxr5i 0.0038602941% reduce > copy >
060211 102749 task_r_1y6bo3 0.0013655463% reduce > copy >
060211 102750 task_r_mpgxy6 0.0030462185% reduce > copy >
060211 102750 task_r_wv3zz8 Got 9520 map output locations.
060211 102750 task_r_wv3zz8 0.0% reduce > copy >

It works for a while until it and the similarly troubled task_r_ohc2qm finally start getting tons of timeout messages (which are probably associated with the "Timed out." cells in the Job Details web UI page):

060211 121848 Task task_r_wv3zz8 timed out.  Killing.

It's not the first task to have such a message, though (e.g., task_r_atwmnl).

After its last "timed out. Killing." entry, task_r_wv3zz8 suddenly starts working again:

060211 135903 task_r_wv3zz8 0.24821429% reduce > copy >

This goes on until task_r_wv3zz8 finally gets halfway done:

060211 142925 task_r_ohc2qm 0.30866596% reduce > append > /home/crawler/tmp/local/task_r_ohc2qm/task_m_38qfi8.out
060211 142925 task_r_mpgxy6 1.0% closing > reduce
060211 142925 task_r_of6d9i 1.0% closing > reduce
060211 142925 task_r_wv3zz8 0.49706268% reduce > append > /home/crawler/tmp/local/task_r_wv3zz8/task_m_1jrvzr.out
060211 142926 task_r_juxr5i 1.0% closing > reduce
060211 142926 task_r_vvic99 1.0% closing > reduce
060211 142926 task_r_1y6bo3 1.0% closing > reduce
060211 142926 task_r_df5hab 1.0% closing > reduce
060211 142926 task_r_ohc2qm 0.30866757% reduce > append > /home/crawler/tmp/local/task_r_ohc2qm/task_m_38qfi8.out
060211 142926 task_r_mpgxy6 1.0% closing > reduce
060211 142926 task_r_of6d9i 1.0% closing > reduce
060211 142927 task_r_wv3zz8 0.5% reduce > sort

It's interesting that both tasks (task_r_wv3zz8 and task_r_ohc2qm) seem to be appending to output files somehow associated with other tasks (i.e., task_m_1jrvzr.out and task_m_38qfi8.out, respectively).

It never makes any progress after that.

task_r_ohc2qm eventually makes it to 50% as well:

060211 143839 task_r_ohc2qm 0.4956699% reduce > append > /home/crawler/tmp/local/task_r_ohc2qm/task_m_egamh4.out
060211 143839 task_r_juxr5i 1.0% closing > reduce
060211 143839 task_r_wv3zz8 0.5% reduce > sort
060211 143839 task_r_mpgxy6 1.0% closing > reduce
060211 143839 task_r_vvic99 1.0% closing > reduce
060211 143839 task_r_df5hab 1.0% closing > reduce
060211 143840 task_r_1y6bo3 1.0% closing > reduce
060211 143840 task_r_of6d9i 1.0% closing > reduce
060211 143840 task_r_ohc2qm 0.5% reduce > sort

Both tasks still remain at 50% currently in the TaskTracker log.

Interestingly, another task that also had timeouts in the log and then hung out at 50% for nearly forever, eventually starting making progress:

060211 023928 task_r_atwmnl 0.5% reduce > sort
060211 023929 task_r_o044ev 0.2521309% reduce > append > /home/crawler/tmp/local/task_r_o044ev/task_m_amfzg7.out
060211 023929 task_r_tnlohx 1.0% closing > reduce
060211 023929 task_r_atwmnl  Client connection to 192.168.1.11:8009: starting
060211 023929 task_r_atwmnl 0.7500587% reduce > reduce

And it seems that what it was waiting for was a connection to port 8009 on m1.krugle.net (NDFS/NameNode).

Strangely, task_r_atwmnl didn't appear anywhere in the web UI.

- Schmed
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

Reply via email to