After all the jobs fail I can't run anything. Once I restart the cluster I am able to run other jobs with no problems, hadoop fs and other io intensive jobs run just fine.
On Fri, Apr 27, 2012 at 3:12 PM, John George <john...@yahoo-inc.com> wrote: > Can you run a regular 'hadoop fs' (put orls or get) command? > If yes, how about a wordcount example? > '<path>/hadoop jar <path>hadoop-*examples*.jar wordcount input output' > > > -----Original Message----- > From: Mohit Anchlia <mohitanch...@gmail.com> > Reply-To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org> > Date: Fri, 27 Apr 2012 14:36:49 -0700 > To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org> > Subject: Re: DFSClient error > > >I even tried to reduce number of jobs but didn't help. This is what I see: > > > >datanode logs: > > > >Initializing secure datanode resources > >Successfully obtained privileged resources (streaming port = > >ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port = > >sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075]) > >Starting regular datanode initialization > >26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return value > >of 143 > > > >userlogs: > > > >2012-04-26 19:35:22,801 WARN > >org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is > >available > >2012-04-26 19:35:22,801 INFO > >org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library > >loaded > >2012-04-26 19:35:22,808 INFO > >org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & > >initialized native-zlib library > >2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to > >connect to /125.18.62.197:50010, add to deadNodes and continue > >java.io.EOFException > > at java.io.DataInputStream.readShort(DataInputStream.java:298) > > at > >org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien > >t.java:1664) > > at > >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j > >ava:2383) > > at > >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java > >:2056) > > at > >org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170) > > at java.io.DataInputStream.read(DataInputStream.java:132) > > at > >org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr > >essorStream.java:97) > > at > >org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt > >ream.java:87) > > at > >org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j > >ava:75) > > at java.io.InputStream.read(InputStream.java:85) > > at > >org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) > > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) > > at > >org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe > >cordReader.java:114) > > at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109) > > at > >org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead > >er.nextKeyValue(PigRecordReader.java:187) > > at > >org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT > >ask.java:456) > > at > >org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. > >java:1157) > > at org.apache.hadoop.mapred.Child.main(Child.java:264) > >2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to > >connect to /125.18.62.204:50010, add to deadNodes and continue > >java.io.EOFException > > > >namenode logs: > > > >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job > >job_201204261140_0244 added successfully for user 'hadoop' to queue > >'default' > >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: > >Initializing job_201204261140_0244 > >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger: > >USER=hadoop IP=125.18.62.196 OPERATION=SUBMIT_JOB > >TARGET=job_201204261140_0244 RESULT=SUCCESS > >2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress: > >Initializing job_201204261140_0244 > >2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception > >in > >createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad > >connect ack with firstBadLink as 125.18.62.197:50010 > >2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > >block blk_2499580289951080275_22499 > >2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding > >datanode 125.18.62.197:50010 > >2012-04-26 16:12:53,594 INFO org.apache.hadoop.mapred.JobInProgress: > >jobToken generated and stored with users keys in > >/data/hadoop/mapreduce/job_201204261140_0244/jobToken > >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Input > >size for job job_201204261140_0244 = 73808305. Number of splits = 1 > >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: > >tip:task_201204261140_0244_m_000000 has split on node:/default-rack/ > >dsdb4.corp.intuit.net > >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: > >tip:task_201204261140_0244_m_000000 has split on node:/default-rack/ > >dsdb5.corp.intuit.net > >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: > >job_201204261140_0244 LOCALITY_WAIT_FACTOR=0.4 > >2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Job > >job_201204261140_0244 initialized successfully with 1 map tasks and 0 > >reduce tasks. > > > >On Fri, Apr 27, 2012 at 7:50 AM, Mohit Anchlia > ><mohitanch...@gmail.com>wrote: > > > >> > >> > >> On Thu, Apr 26, 2012 at 10:24 PM, Harsh J <ha...@cloudera.com> wrote: > >> > >>> Is only the same IP printed in all such messages? Can you check the DN > >>> log in that machine to see if it reports any form of issues? > >>> > >>> All IPs were logged with this message > >> > >> > >>> Also, did your jobs fail or kept going despite these hiccups? I notice > >>> you're threading your clients though (?), but I can't tell if that may > >>> cause this without further information. > >>> > >>> It started with this error message and slowly all the jobs died with > >> "shortRead" errors. > >> I am not sure about threading. I am using pig script to read .gz file > >> > >> > >>> On Fri, Apr 27, 2012 at 5:19 AM, Mohit Anchlia <mohitanch...@gmail.com > > > >>> wrote: > >>> > I had 20 mappers in parallel reading 20 gz files and each file around > >>> > 30-40MB data over 5 hadoop nodes and then writing to the analytics > >>> > database. Almost midway it started to get this error: > >>> > > >>> > > >>> > 2012-04-26 16:13:53,723 [Thread-8] INFO > >>> org.apache.hadoop.hdfs.DFSClient - > >>> > Exception in createBlockOutputStream > >>> > 17.18.62.192:50010java.io.IOException: Bad connect ack with > >>> > firstBadLink as > >>> > 17.18.62.191:50010 > >>> > > >>> > I am trying to look at the logs but doesn't say much. What could be > >>>the > >>> > reason? We are in pretty closed reliable network and all machines are > >>> up. > >>> > >>> > >>> > >>> -- > >>> Harsh J > >>> > >> > >> > >