Caught something today I missed before:

11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010
11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
blk_-517003810449127046_10039793
11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
10.120.41.103:50010
11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
blk_2153189599588075377_10039793
11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
10.120.41.105:50010
11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
/tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...



On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin <[email protected]>wrote:

> Thanks. Spent a lot of time looking at logs and nothing on the reducers
> until they start complaining about 'could not complete'.
>
> Found this in the jobtracker log file:
>
> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
> 10.120.41.103:50010
> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_3829493505250917008_9959810 in pipeline
> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
> datanode 10.120.41.103:50010
> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
> retrying...
>
> Looking at the logs from the various times this happens, the 'from
> datanode' in the first message is any of the data nodes (roughly equal in #
> of times it fails), so I don't think it is one specific node having
> problems.
> Any other ideas?
>
> Thanks,
>
> Chris
>   On Sun, Mar 13, 2011 at 3:45 AM, icebergs <[email protected]> wrote:
>
>> You should check the bad reducers' logs carefully.There may be more
>> information about it.
>>
>> 2011/3/10 Chris Curtin <[email protected]>
>>
>> > Hi,
>> >
>> > The last couple of days we have been seeing 10's of thousands of these
>> > errors in the logs:
>> >
>> >  INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
>> >
>> >
>> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
>> > retrying...
>> > When this is going on the reducer in question is always the last reducer
>> in
>> > a job.
>> >
>> > Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
>> runs
>> > another and it succeeds. Sometimes hadoop kills the reducer and the new
>> one
>> > also fails, so it gets killed and the cluster goes into a loop of
>> > kill/launch/kill.
>> >
>> > At first we thought it was related to the size of the data being
>> evaluated
>> > (4+GB), but we've seen it several times today on < 100 MB
>> >
>> > Searching here or online doesn't show a lot about what this error means
>> and
>> > how to fix it.
>> >
>> > We are running 0.20.2, r911707
>> >
>> > Any suggestions?
>> >
>> >
>> > Thanks,
>> >
>> > Chris
>> >
>>
>
>

Reply via email to