Hi Bobby, I suspected OOM problems. Just before I got your last message I made = some config changes: First I discovered that I was setting a property = called mapreduce.child.java.opts in my mapred-site.xml and apparently = this property is deprecated. I edited it to set = mapreduce.map.child.java.opts =3D -Xmx1024m and = mapreduce.reduce.child.java.opts =3D -Xmx1024m. I also edited = hadoop-env.sh so that HADOOP_NAMENODE_OPTS has -Xmx4g and = HADOOP_DATANODE_OPTS has -Xmx1024m.=20
Not very scientific to change these all at once but I'm in a hurry. The job is now running 25% faster than before and log files are growing = more slowly. No errors and just three failed tasks (all of which worked = second try) after 3.5 hours. Using top I can see that no java processes are swapping or going mad but = I can see that some have swapped in the past so maybe the changes have = made a difference I'll keep checking over the weekend. Should know whether it's going to = work by tomorrow a.m. (London time). Thanks for your help. I'll do my best to report what I did if I resolve = this problem. The Internet is full of unresolveds. Royston On 4 Jan 2013, at 15:16, Robert Evans <[email protected]> wrote: > This really should be on the user list so I am moving it over there. > > It is probably something about the OS that is killing it. The only thing > that I know of on stock Linux that would do this is the Out of Memory > Killer. Do you have swap enabled on these boxes? You should check the > OOM killer logs, and if that is the case reset the box. > > --Bobby > > On 1/4/13 9:02 AM, "Royston Sellman" <[email protected]> > wrote: > >> Hi Bobby, >> >> Thanks for the response Bobby, >> >> The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" >> contained the log messages included in our previous message. It seems to >> show a series of successful map attempts with a few reduce attempts >> interspersed, then it gets to a point and only shows a series of reduce >> attempts that appear to be stuck at the same level of progress, before >> outputting the 143 exit code and the interrupted sleep message at the end. >> >> There is nothing in the tasktracker~.out files... >> >> The machines did not go down but the affected TTs did not log anything >> till I got up in the morning, saw the job had frozen, did stop-all.sh. >> Then the stalled TTs logged the shutdown. >> >> The disks are not full (67% usage across 12 disks per worker). >> >> It seems that the 143 exit code indicates that an external process has >> terminated our tasktracker JVM. Is this correct? >> >> If so, what would the likely suspects be that would terminate our >> tasktrackers? Is it possible this is related to our operating system >> (Scientific Linux 6) and PAM limits? >> >> We had already increased our hard limit on the number of open files for >> the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the >> hope that this would solve the issue. Can you see anything wrong with our >> security limits: >> >> [hdfs@hd-37-03 hdfs]$ ulimit -aH >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> scheduling priority (-e) 0 >> file size (blocks, -f) unlimited >> pending signals (-i) 191988 >> max locked memory (kbytes, -l) 64 >> max memory size (kbytes, -m) unlimited >> open files (-n) 32768 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> real-time priority (-r) 0 >> stack size (kbytes, -s) unlimited >> cpu time (seconds, -t) unlimited >> max user processes (-u) unlimited >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited >> >> Thanks for your help. >> >> Royston >> >> On 4 Jan 2013, at 14:34, Robert Evans <[email protected]> wrote: >> >>> Is there anything in the task tracker's logs? Did the machines go down? >>> Are there full disks on those nodes? >>> >>> --Bobby >>> >>> On 1/4/13 5:52 AM, "Royston Sellman" <[email protected]> >>> wrote: >>> >>>> I'm running a job over a 380 billion row 20 TB dataset which is >>>> computing >>>> sum(), max() etc. The job is running fine at around 3 million rows per >>>> second for several hours then grinding to a halt as it loses one after >>>> another of the tasktrackers. We see a healthy mix of successful map >>>> and >>>> reduce attempts on the tasktracker... >>>> >>>> >>>> >>>> 2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041109_0 1.0% >>>> >>>> 2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041105_0 1.0% >>>> >>>> 2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041105_0 1.0% >>>> >>>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task >>>> attempt_201301031813_0001_m_041105_0 is done. >>>> >>>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >>>> reported >>>> output size for attempt_201301031813_0001_m_041105_0 was 111 >>>> >>>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >>>> addFreeSlot : current free slots : 8 >>>> >>>> 2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041106_0 0.9884119% >>>> >>>> 2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM : >>>> jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of >>>> tasks >>>> it ran: 1 >>>> >>>> 2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041103_0 0.9884134% >>>> >>>> 2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041101_0 1.0% >>>> >>>> 2013-01-03 23:41:43,193 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041101_0 1.0% >>>> >>>> 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: Task >>>> attempt_201301031813_0001_m_041101_0 is done. >>>> >>>> 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: >>>> reported >>>> output size for attempt_201301031813_0001_m_041101_0 was 111 >>>> >>>> 2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: >>>> addFreeSlot : current free slots : 9 >>>> >>>> 2013-01-03 23:41:43,303 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041109_0 1.0% >>>> >>>> 2013-01-03 23:41:43,306 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041109_0 1.0% >>>> >>>> 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: Task >>>> attempt_201301031813_0001_m_041109_0 is done. >>>> >>>> 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: >>>> reported >>>> output size for attempt_201301031813_0001_m_041109_0 was 111 >>>> >>>> 2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: >>>> addFreeSlot : current free slots : 10 >>>> >>>> 2013-01-03 23:41:43,361 INFO org.apache.hadoop.mapred.JvmManager: JVM : >>>> jvm_201301031813_0001_m_36690963 exited with exit code 0. Number of >>>> tasks >>>> it >>>> ran: 1 >>>> >>>> 2013-01-03 23:41:43,428 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041106_0 1.0% >>>> >>>> 2013-01-03 23:41:43,432 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041106_0 1.0% >>>> >>>> 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: Task >>>> attempt_201301031813_0001_m_041106_0 is done. >>>> >>>> 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: >>>> reported >>>> output size for attempt_201301031813_0001_m_041106_0 was 111 >>>> >>>> 2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: >>>> addFreeSlot : current free slots : 11 >>>> >>>> 2013-01-03 23:41:43,457 INFO org.apache.hadoop.mapred.JvmManager: JVM : >>>> jvm_201301031813_0001_m_-2095784622 exited with exit code 0. Number of >>>> tasks >>>> it ran: 1 >>>> >>>> 2013-01-03 23:41:43,595 INFO org.apache.hadoop.mapred.JvmManager: JVM : >>>> jvm_201301031813_0001_m_1190449426 exited with exit code 0. Number of >>>> tasks >>>> it ran: 1 >>>> >>>> 2013-01-03 23:41:43,862 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041103_0 1.0% >>>> >>>> 2013-01-03 23:41:43,866 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_m_041103_0 1.0% >>>> >>>> 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: Task >>>> attempt_201301031813_0001_m_041103_0 is done. >>>> >>>> 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: >>>> reported >>>> output size for attempt_201301031813_0001_m_041103_0 was 111 >>>> >>>> 2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: >>>> addFreeSlot : current free slots : 12 >>>> >>>> 2013-01-03 23:41:44,021 INFO org.apache.hadoop.mapred.JvmManager: JVM : >>>> jvm_201301031813_0001_m_-505301168 exited with exit code 0. Number of >>>> tasks >>>> it ran: 1 >>>> >>>> 2013-01-03 23:41:45,539 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050235882% reduce > copy (40975 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> >>>> >>>> >>>> >>>> Then it seems to get stuck on reduce attempts, before exiting with a >>>> SIG_TERM (143) exit code... >>>> >>>> >>>> >>>> 2013-01-03 23:50:49,642 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:50:55,678 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:01,717 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:04,755 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:10,796 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:16,831 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:19,870 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:25,911 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:31,953 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:34,987 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:41,023 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:47,063 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:50,102 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:51:56,143 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:02,179 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:05,213 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:11,254 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:17,295 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:20,334 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:26,375 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:32,411 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:35,445 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:41,486 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:44,526 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:50,567 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:56,608 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:52:59,648 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:05,689 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:11,730 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:14,764 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:20,804 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:26,844 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:29,883 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:35,924 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:41,964 INFO org.apache.hadoop.mapred.TaskTracker: >>>> attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 >>>> of >>>> 271884 at 0.00 MB/s) > >>>> >>>> 2013-01-03 23:53:42,616 INFO org.apache.hadoop.mapred.TaskTracker: >>>> Recieved >>>> ReinitTrackerAction from JobTracker >>>> >>>> 2013-01-03 23:53:42,881 INFO org.apache.hadoop.util.ProcessTree: >>>> Killing >>>> process group6089 with signal TERM. Exit code 0 >>>> >>>> 2013-01-03 23:53:42,881 INFO org.apache.hadoop.mapred.TaskTracker: >>>> addFreeSlot : current free slots : 12 >>>> >>>> 2013-01-03 23:53:43,348 WARN >>>> org.apache.hadoop.mapred.DefaultTaskController: >>>> Exit code from task is : 143 >>>> >>>> 2013-01-03 23:53:43,348 INFO >>>> org.apache.hadoop.mapred.DefaultTaskController: >>>> Output from DefaultTaskController's launchTask follows: >>>> >>>> 2013-01-03 23:53:43,348 INFO org.apache.hadoop.mapred.TaskController: >>>> >>>> 2013-01-03 23:53:43,349 INFO org.apache.hadoop.mapred.JvmManager: JVM : >>>> jvm_201301031813_0001_r_-2077112828 exited with exit code 143. Number >>>> of >>>> tasks it ran: 0 >>>> >>>> 2013-01-03 23:54:28,484 INFO >>>> org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup... >>>> >>>> java.lang.InterruptedException: sleep interrupted >>>> >>>> at java.lang.Thread.sleep(Native Method) >>>> >>>> at >>>> >>>> org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread >>>> .r >>>> un >>>> (TrackerDistributedCacheManager.java:947) >>>> >>>> >>>> >>>> Just around this time (actually a bit before) the JobTracker log >>>> reports >>>> the >>>> following: >>>> >>>> 2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.JobTracker: Lost >>>> tracker 'tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005' >>>> >>>> 2013-01-03 23:52:31,348 INFO org.apache.hadoop.mapred.TaskInProgress: >>>> Error >>>> from attempt_201301031813_0001_m_000010_0: Lost task tracker: >>>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 >>>> >>>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: >>>> Error >>>> from attempt_201301031813_0001_m_000039_0: Lost task tracker: >>>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 >>>> >>>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: >>>> Error >>>> from attempt_201301031813_0001_m_000041_0: Lost task tracker: >>>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 >>>> >>>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: >>>> Error >>>> from attempt_201301031813_0001_m_000089_0: Lost task tracker: >>>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 >>>> >>>> 2013-01-03 23:52:31,349 INFO org.apache.hadoop.mapred.TaskInProgress: >>>> Error >>>> from attempt_201301031813_0001_m_000092_0: Lost task tracker: >>>> tracker_hd-37-03.dice.cluster:localhost/127.0.0.1:36005 >>>> >>>> Etc etc. >>>> >>>> >>>> >>>> Can anyone help with this? >>>> >>>> >>>> >>>> Royston >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >
