0.20: OOME in HDFS client made an unrecoverable HDFS block
----------------------------------------------------------
Key: HDFS-1264
URL: https://issues.apache.org/jira/browse/HDFS-1264
Project: Hadoop HDFS
Issue Type: Bug
Components: data-node, hdfs client
Affects Versions: 0.20-append
Reporter: Todd Lipcon
Fix For: 0.20-append
Ran into a bad issue in testing overnight. One of the writers experienced an
OOME in the middle of writing a checksum chunk to the stream inside a sync()
call. It then proceeded to retry recovery on each DN in the pipeline, but each
recovery failed because its internal checksum buffer was borked in some way -
on the DNs I see "Unexpected checksum mismatch" errors after each recovery
attempt.
When another client tried to recover the file using appendFile, it got the
"Partial CRC 3766269197 does not match value computed the last time file was
closed" error (plus there was only one replica left in targets). It thus failed
to set up the append pipeline, and ran into HDFS-1262.
This was on 0.20-append, though it may happen on trunk as well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.