On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G <mahesw...@huawei.com> wrote: > Yes, Todd, block after restart is small and genstamp also lesser. > Here complete machine reboot happend. The boards are configured like, if it > is not getting any CPU cycles for 480secs, it will reboot himself. > kernal.hung_task_timeout_secs = 480 sec.
So sounds like the following happened: - while writing file, the pipeline got reduced down to 1 node due to timeouts from the other two - soon thereafter (before more replicas were made), that last replica kernel-paniced without syncing the data - on reboot, the filesystem lost some edits from its ext3 journal, and the block got moved back into the RBW directly, with truncated data - hdfs did "the right thing" - at least what the algorithms say it should do, because it had gotten a commitment for a later replica If you have a build which includes HDFS-1539, you could consider setting dfs.datanode.synconclose to true, which would have prevented this problem. -Todd -- Todd Lipcon Software Engineer, Cloudera