[Nutch-general] What happens when you index too much at once?

Chris Schneider Sat, 11 Feb 2006 16:51:01 -0800

Gang,

I recently tried to index 40 segments at once. The job nearlycompleted (99.999214%?), but the last two reduce tasks stalled out at50% complete.

I think this problem would probably be solved by increasing theipc.client.timeout value again. However, this was the first time I'vetried to index 40 segments at once. I think a better solution is toindex only 20 at a time, then merge more "indexes" directories atonce (assuming that doesn't also choke) to create the final indexdirectory on the slave. That's what I'm about to do right now.


Details of my analysis:

Job % kept increasing until it finally got pegged at 99.999214%.

task_r_wv3zz8 was first assigned to tracker_48357 at 10:30am this morning:

060211 102748 parsing /home/crawler/tmp/local/taskTracker/task_r_wv3zz8/job.xml
060211 102748 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060211 102748 task_r_juxr5i 0.0037815126% reduce > copy >
060211 102749 task_r_vvic99 3.1512606E-4% reduce > copy >
060211 102749 task_r_ohc2qm 2.3634454E-4% reduce > copy >
060211 102749 task_r_df5hab 0.0058035715% reduce > copy >
060211 102749 task_r_juxr5i 0.0038602941% reduce > copy >
060211 102749 task_r_1y6bo3 0.0013655463% reduce > copy >
060211 102750 task_r_mpgxy6 0.0030462185% reduce > copy >
060211 102750 task_r_wv3zz8 Got 9520 map output locations.
060211 102750 task_r_wv3zz8 0.0% reduce > copy >

It works for a while until it and the similarly troubledtask_r_ohc2qm finally start getting tons of timeout messages (whichare probably associated with the "Timed out." cells in the JobDetails web UI page):


060211 121848 Task task_r_wv3zz8 timed out.  Killing.

It's not the first task to have such a message, though (e.g., task_r_atwmnl).

After its last "timed out. Killing." entry, task_r_wv3zz8 suddenlystarts working again:


060211 135903 task_r_wv3zz8 0.24821429% reduce > copy >

This goes on until task_r_wv3zz8 finally gets halfway done:

060211 142925 task_r_ohc2qm 0.30866596% reduce > append >/home/crawler/tmp/local/task_r_ohc2qm/task_m_38qfi8.out

060211 142925 task_r_mpgxy6 1.0% closing > reduce
060211 142925 task_r_of6d9i 1.0% closing > reduce

060211 142925 task_r_wv3zz8 0.49706268% reduce > append >/home/crawler/tmp/local/task_r_wv3zz8/task_m_1jrvzr.out

060211 142926 task_r_juxr5i 1.0% closing > reduce
060211 142926 task_r_vvic99 1.0% closing > reduce
060211 142926 task_r_1y6bo3 1.0% closing > reduce
060211 142926 task_r_df5hab 1.0% closing > reduce

060211 142926 task_r_ohc2qm 0.30866757% reduce > append >/home/crawler/tmp/local/task_r_ohc2qm/task_m_38qfi8.out

060211 142926 task_r_mpgxy6 1.0% closing > reduce
060211 142926 task_r_of6d9i 1.0% closing > reduce
060211 142927 task_r_wv3zz8 0.5% reduce > sort

It's interesting that both tasks (task_r_wv3zz8 and task_r_ohc2qm)seem to be appending to output files somehow associated with othertasks (i.e., task_m_1jrvzr.out and task_m_38qfi8.out, respectively).


It never makes any progress after that.

task_r_ohc2qm eventually makes it to 50% as well:

060211 143839 task_r_ohc2qm 0.4956699% reduce > append >/home/crawler/tmp/local/task_r_ohc2qm/task_m_egamh4.out

060211 143839 task_r_juxr5i 1.0% closing > reduce
060211 143839 task_r_wv3zz8 0.5% reduce > sort
060211 143839 task_r_mpgxy6 1.0% closing > reduce
060211 143839 task_r_vvic99 1.0% closing > reduce
060211 143839 task_r_df5hab 1.0% closing > reduce
060211 143840 task_r_1y6bo3 1.0% closing > reduce
060211 143840 task_r_of6d9i 1.0% closing > reduce
060211 143840 task_r_ohc2qm 0.5% reduce > sort

Both tasks still remain at 50% currently in the TaskTracker log.

Interestingly, another task that also had timeouts in the log andthen hung out at 50% for nearly forever, eventually starting makingprogress:


060211 023928 task_r_atwmnl 0.5% reduce > sort

060211 023929 task_r_o044ev 0.2521309% reduce > append >/home/crawler/tmp/local/task_r_o044ev/task_m_amfzg7.out

060211 023929 task_r_tnlohx 1.0% closing > reduce
060211 023929 task_r_atwmnl  Client connection to 192.168.1.11:8009: starting
060211 023929 task_r_atwmnl 0.7500587% reduce > reduce

And it seems that what it was waiting for was a connection to port8009 on m1.krugle.net (NDFS/NameNode).


Strangely, task_r_atwmnl didn't appear anywhere in the web UI.

- Schmed
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

[Nutch-general] What happens when you index too much at once?

Reply via email to