Hi, Putting this thread back in pool to leverage collective intelligence. If you get the full command line of the java processes, it wouldn't be difficult to correlate reduce task(s) with a particular job.
Cheers On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > Hello Ted, > > Thank you for the suggestions :). I haven't come across any other > serious issue before this one. Infact, the same MR job runs for a smaller > input size, although, lot slower than what we expected. > > I will use jstack to get stack trace. I had a question in this regard. How > would I know which MR job (job id) is related to which java process (pid)? I > can get a list of hadoop jobs with "hadoop job -list" and list of java > processes with "jps" but how I couldn't determine how to get the connection > between these 2 lists. > > > Thank you again. > > Regards, > Raghava. > > On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> If you look at >> https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776, >> you can see that >> hdfs-127-branch20-redone-v2.txt<https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was >> the latest. >> >> You need to download the source code corresponding to your version of >> hadoop, apply the patch and rebuild. >> >> If you haven't experienced serious issue with hadoop for other scenarios, >> we should try to find out the root cause for the current problem without the >> 127 patch. >> >> My advice is to use jstack to find what each thread was waiting for after >> reducers get stuck. >> I would expect a deadlock in either your code or hdfs, I would think it >> should the former. >> >> You can replace sensitive names in the stack traces and paste it if you >> cannot determine the deadlock. >> >> Cheers >> >> >> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju < >> m.vijayaragh...@gmail.com> wrote: >> >>> Hello Ted, >>> >>> Thank you for the reply. Will this change fix my issue? I asked >>> this because I again need to convince my admin to make this change. >>> >>> We have a gateway to the cluster-head. We generally run our MR jobs >>> on the gateway. Should this change be made to the hadoop installation on the >>> gateway? >>> >>> 1) I am confused on which patch to be applied? There are 4 patches listed >>> at https://issues.apache.org/jira/browse/HDFS-127 >>> >>> 2) How to apply the patch? Should we change the lines of code specified >>> and rebuild hadoop? Or is there any other way? >>> >>> Thank you again. >>> >>> Regards, >>> Raghava. >>> >>> >>> On Fri, Apr 16, 2010 at 6:42 PM, <yuzhih...@gmail.com> wrote: >>> >>>> That patch is very important. >>>> >>>> please apply it. >>>> >>>> Sent from my Verizon Wireless BlackBerry >>>> ------------------------------ >>>> *From: * Raghava Mutharaju <m.vijayaragh...@gmail.com> >>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400 >>>> *To: *Ted Yu<yuzhih...@gmail.com> >>>> *Subject: *Re: Reduce gets struck at 99% >>>> >>>> Hi Ted, >>>> >>>> It took sometime to contact my department's admin (he was on >>>> leave) and ask him to make ulimit changes effective in the cluster (just >>>> adding entry in /etc/security/limits.conf was not sufficient, so took >>>> sometime to figure out). Now the ulimit is 32768. I ran the set of MR jobs, >>>> the result is the same --- it gets stuck at Reduce 99%. But this time, >>>> there >>>> are no exceptions in the logs. I view JobTracker logs through the Web UI. I >>>> checked "Running Jobs" as well as "Failed Jobs". >>>> >>>> I haven't asked the admin to apply the patch >>>> https://issues.apache.org/jira/browse/HDFS-127 that you mentioned >>>> earlier. Is this important? >>>> >>>> Do you any suggestions? >>>> >>>> Thank you. >>>> >>>> Regards, >>>> Raghava. >>>> >>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> For the user under whom you launch MR jobs. >>>>> >>>>> >>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju < >>>>> m.vijayaragh...@gmail.com> wrote: >>>>> >>>>>> Hi Ted, >>>>>> >>>>>> Sorry to bug you again :) but I do not have an account on all >>>>>> the datanodes, I just have it on the machine on which I start the MR >>>>>> jobs. >>>>>> So is it required to increase the ulimit on all the nodes (in this case >>>>>> the >>>>>> admin may have to increase it for all the users?) >>>>>> >>>>>> >>>>>> Regards, >>>>>> Raghava. >>>>>> >>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>> >>>>>>> ulimit should be increased on all nodes. >>>>>>> >>>>>>> The link I gave you lists several actions to take. I think they're >>>>>>> not specifically for hbase. >>>>>>> Also make sure the following is applied: >>>>>>> https://issues.apache.org/jira/browse/HDFS-127 >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju < >>>>>>> m.vijayaragh...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello Ted, >>>>>>>> >>>>>>>> Should the increase in ulimit to 32768 be applied on all the >>>>>>>> datanodes (its a 16 node cluster)? Is this related to HBase, because I >>>>>>>> am >>>>>>>> not using HBase. >>>>>>>> Are the exceptions & delay (at Reduce 99%) due to this? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Raghava. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Your ulimit is low. >>>>>>>>> Ask your admin to increase it to 32768 >>>>>>>>> >>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju < >>>>>>>>> m.vijayaragh...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Ted, >>>>>>>>>> >>>>>>>>>> I am pasting below the timestamps from the log. >>>>>>>>>> >>>>>>>>>> Lease-exception: >>>>>>>>>> >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle Finished >>>>>>>>>> Sort Finished Finish Time Errors Task Logs >>>>>>>>>> Counters Actions >>>>>>>>>> attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15 >>>>>>>>>> FAILED 0.00% >>>>>>>>>> 8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010 >>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec) >>>>>>>>>> >>>>>>>>>> ------------------------------------- >>>>>>>>>> >>>>>>>>>> DFS Client Exception: >>>>>>>>>> >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle Finished >>>>>>>>>> Sort Finished Finish Time Errors Task Logs >>>>>>>>>> Counters Actions >>>>>>>>>> attempt_201004060646_0057_r_000006_0 /default-rack/ >>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00% >>>>>>>>>> 8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010 >>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec) >>>>>>>>>> ------------------------------------------ >>>>>>>>>> >>>>>>>>>> The file limit is set to 1024. I checked couple of datanodes. I >>>>>>>>>> haven't checked the headnode though. >>>>>>>>>> >>>>>>>>>> The no of currently open files under my username, on the system on >>>>>>>>>> which I started the MR jobs are 346 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thank you for you help :) >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Raghava. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu <yuzhih...@gmail.com>wrote: >>>>>>>>>> >>>>>>>>>>> Can you give me the timestamps of the two exceptions ? >>>>>>>>>>> I want to see if they're related. >>>>>>>>>>> >>>>>>>>>>> I saw DFSClient$DFSOutputStream.close() in the first stack trace. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu <yuzhih...@gmail.com>wrote: >>>>>>>>>>> >>>>>>>>>>>> just to double check it's not a file >>>>>>>>>>>> limits issue could you run the following on each of the hosts: >>>>>>>>>>>> >>>>>>>>>>>> $ ulimit -a >>>>>>>>>>>> $ lsof | wc -l >>>>>>>>>>>> >>>>>>>>>>>> The first command will show you (among other things) the file >>>>>>>>>>>> limits, it >>>>>>>>>>>> should be above the default 1024. The second will tell you have >>>>>>>>>>>> many files >>>>>>>>>>>> are currently open... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju < >>>>>>>>>>>> m.vijayaragh...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Ted, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for all the suggestions. I went through the >>>>>>>>>>>>> job tracker logs and I have attached the exceptions found in the >>>>>>>>>>>>> logs. I >>>>>>>>>>>>> found two exceptions >>>>>>>>>>>>> >>>>>>>>>>>>> 1) org.apache.hadoop.ipc.RemoteException: java.io.IOException: >>>>>>>>>>>>> Could not complete write to file (DFS Client) >>>>>>>>>>>>> >>>>>>>>>>>>> 2) org.apache.hadoop.ipc.RemoteException: >>>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No >>>>>>>>>>>>> lease on >>>>>>>>>>>>> /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014 >>>>>>>>>>>>> File does not exist. Holder >>>>>>>>>>>>> DFSClient_attempt_201004060646_0057_r_000014_0 >>>>>>>>>>>>> does not have any open files. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The exception occurs at the point of writing out <K,V> pairs in >>>>>>>>>>>>> the reducer and it occurs only in certain task attempts. I am not >>>>>>>>>>>>> using any >>>>>>>>>>>>> custom output format or record writers but I do use custom input >>>>>>>>>>>>> reader. >>>>>>>>>>>>> >>>>>>>>>>>>> What could have gone wrong here? >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Raghava. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yuzhih...@gmail.com>wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Raghava: >>>>>>>>>>>>>> Are you able to share the last segment of reducer log ? >>>>>>>>>>>>>> You can get them from web UI: >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Adding more log in your reducer task would help pinpoint where >>>>>>>>>>>>>> the issue is. >>>>>>>>>>>>>> Also look in job tracker log. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju < >>>>>>>>>>>>>> m.vijayaragh...@gmail.com >>>>>>>>>>>>>> > wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Hi Ted, >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Thank you for the suggestion. I enabled it using the >>>>>>>>>>>>>> Configuration >>>>>>>>>>>>>> > class because I cannot change hadoop-site.xml file (I am not >>>>>>>>>>>>>> an admin). The >>>>>>>>>>>>>> > situation is still the same --- it gets stuck at reduce 99% >>>>>>>>>>>>>> and does not >>>>>>>>>>>>>> > move further. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Regards, >>>>>>>>>>>>>> > Raghava. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yuzhih...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > > You need to turn on yourself (hadoop-site.xml): >>>>>>>>>>>>>> > > <property> >>>>>>>>>>>>>> > > <name>mapred.reduce.tasks.speculative.execution</name> >>>>>>>>>>>>>> > > <value>true</value> >>>>>>>>>>>>>> > > </property> >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > <property> >>>>>>>>>>>>>> > > <name>mapred.map.tasks.speculative.execution</name> >>>>>>>>>>>>>> > > <value>true</value> >>>>>>>>>>>>>> > > </property> >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju < >>>>>>>>>>>>>> > > m.vijayaragh...@gmail.com >>>>>>>>>>>>>> > > > wrote: >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > > Hi, >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > Thank you Eric, Prashant and Greg. Although the >>>>>>>>>>>>>> timeout problem was >>>>>>>>>>>>>> > > > resolved, reduce is getting stuck at 99%. As of now, it >>>>>>>>>>>>>> has been stuck >>>>>>>>>>>>>> > > > there >>>>>>>>>>>>>> > > > for about 3 hrs. That is too high a wait time for my >>>>>>>>>>>>>> task. Do you guys >>>>>>>>>>>>>> > > see >>>>>>>>>>>>>> > > > any reason for this? >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > Speculative execution is "on" by default right? Or >>>>>>>>>>>>>> should I enable >>>>>>>>>>>>>> > > it? >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > Regards, >>>>>>>>>>>>>> > > > Raghava. >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence < >>>>>>>>>>>>>> gr...@yahoo-inc.com >>>>>>>>>>>>>> > > > >wrote: >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > > Hi, >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > I have also experienced this problem. Have you tried >>>>>>>>>>>>>> speculative >>>>>>>>>>>>>> > > > execution? >>>>>>>>>>>>>> > > > > Also, I have had jobs that took a long time for one >>>>>>>>>>>>>> mapper / reducer >>>>>>>>>>>>>> > > > because >>>>>>>>>>>>>> > > > > of a record that was significantly larger than those >>>>>>>>>>>>>> contained in the >>>>>>>>>>>>>> > > > other >>>>>>>>>>>>>> > > > > filesplits. Do you know if it always slows down for >>>>>>>>>>>>>> the same >>>>>>>>>>>>>> > filesplit? >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Regards, >>>>>>>>>>>>>> > > > > Greg Lawrence >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" < >>>>>>>>>>>>>> m.vijayaragh...@gmail.com> >>>>>>>>>>>>>> > > > wrote: >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Hello all, >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > I got the time out error as mentioned below >>>>>>>>>>>>>> -- after 600 >>>>>>>>>>>>>> > > > seconds, >>>>>>>>>>>>>> > > > > that attempt was killed and the attempt would be >>>>>>>>>>>>>> deemed a failure. I >>>>>>>>>>>>>> > > > > searched around about this error, and one of the >>>>>>>>>>>>>> suggestions to >>>>>>>>>>>>>> > include >>>>>>>>>>>>>> > > > > "progress" statements in the reducer -- it might be >>>>>>>>>>>>>> taking longer >>>>>>>>>>>>>> > than >>>>>>>>>>>>>> > > > 600 >>>>>>>>>>>>>> > > > > seconds and so is timing out. I added calls to >>>>>>>>>>>>>> context.progress() and >>>>>>>>>>>>>> > > > > context.setStatus(str) in the reducer. Now, it works >>>>>>>>>>>>>> fine -- there >>>>>>>>>>>>>> > are >>>>>>>>>>>>>> > > no >>>>>>>>>>>>>> > > > > timeout errors. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > But, for a few jobs, it takes awfully long >>>>>>>>>>>>>> time to move from >>>>>>>>>>>>>> > > > "Map >>>>>>>>>>>>>> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its >>>>>>>>>>>>>> 15mins and for >>>>>>>>>>>>>> > some >>>>>>>>>>>>>> > > > it >>>>>>>>>>>>>> > > > > was more than an hour. The reduce code is not complex >>>>>>>>>>>>>> -- 2 level loop >>>>>>>>>>>>>> > > and >>>>>>>>>>>>>> > > > > couple of if-else blocks. The input size is also not >>>>>>>>>>>>>> huge, for the >>>>>>>>>>>>>> > job >>>>>>>>>>>>>> > > > that >>>>>>>>>>>>>> > > > > gets struck for an hour at reduce 99%, it would take >>>>>>>>>>>>>> in 130. Some of >>>>>>>>>>>>>> > > them >>>>>>>>>>>>>> > > > > are 1-3 MB in size and couple of them are 16MB in >>>>>>>>>>>>>> size. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Has anyone encountered this problem before? >>>>>>>>>>>>>> Any pointers? I >>>>>>>>>>>>>> > > use >>>>>>>>>>>>>> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Thank you. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Regards, >>>>>>>>>>>>>> > > > > Raghava. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju < >>>>>>>>>>>>>> > > > > m.vijayaragh...@gmail.com> wrote: >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Hi all, >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > I am running a series of jobs one after >>>>>>>>>>>>>> another. While >>>>>>>>>>>>>> > executing >>>>>>>>>>>>>> > > > the >>>>>>>>>>>>>> > > > > 4th job, the job fails. It fails in the reducer --- >>>>>>>>>>>>>> the progress >>>>>>>>>>>>>> > > > percentage >>>>>>>>>>>>>> > > > > would be map 100%, reduce 99%. It gives out the >>>>>>>>>>>>>> following message >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id : >>>>>>>>>>>>>> > > > > attempt_201003240138_0110_r_000018_1, Status : FAILED >>>>>>>>>>>>>> > > > > Task attempt_201003240138_0110_r_000018_1 failed to >>>>>>>>>>>>>> report status for >>>>>>>>>>>>>> > > 602 >>>>>>>>>>>>>> > > > > seconds. Killing! >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > It makes several attempts again to execute it but >>>>>>>>>>>>>> fails with similar >>>>>>>>>>>>>> > > > > message. I couldn't get anything from this error >>>>>>>>>>>>>> message and wanted >>>>>>>>>>>>>> > to >>>>>>>>>>>>>> > > > look >>>>>>>>>>>>>> > > > > at logs (located in the default dir of >>>>>>>>>>>>>> ${HADOOP_HOME/logs}). But I >>>>>>>>>>>>>> > > don't >>>>>>>>>>>>>> > > > > find any files which match the timestamp of the job. >>>>>>>>>>>>>> Also I did not >>>>>>>>>>>>>> > > find >>>>>>>>>>>>>> > > > > history and userlogs in the logs folder. Should I look >>>>>>>>>>>>>> at some other >>>>>>>>>>>>>> > > > place >>>>>>>>>>>>>> > > > > for the logs? What could be the possible causes for >>>>>>>>>>>>>> the above error? >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > I am using Hadoop 0.20.2 and I am running it on >>>>>>>>>>>>>> a cluster with >>>>>>>>>>>>>> > > 16 >>>>>>>>>>>>>> > > > > nodes. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Thank you. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Regards, >>>>>>>>>>>>>> > > > > Raghava. >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >