[
https://issues.apache.org/jira/browse/HDFS-5728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879179#comment-13879179
]
Kihwal Lee commented on HDFS-5728:
----------------------------------
The approach seems okay. It is actually what I did manually to recover. The new
test case seems to be adequate.
There are unnecessary lines of code added, though.
{code}
+ // truncate blockFile
+ blockRAF.setLength(validFileLength);
+
+ // read last chunk
+ blockRAF.seek(lastChunkStartPos);
+ blockRAF.readFully(b, 0, lastChunkSize);
{code}
In the above, the last chunk of the block doesn't have to be read. In
{{truncateBlock()}}, which is called during {{recoverRbw()}}, this is needed in
order to recompute the checksum and write out to the meta file. It is done this
way since simply truncating meta file will cause checksum mismatch, if the new
block size doesn't align with the chunk size. In this jira, this is not
necessary since meta files are not truncated.
It made me think about the case where a block file is smaller than expected.
With the current code, 0 will be returned as the size. Instead, we could
truncate the meta file if the block file length is non-zero. But this should
be rare since a block file is written before the corresponding meta file.
> [Diskfull] Block recovery will fail if the metafile not having crc for all
> chunks of the block
> ----------------------------------------------------------------------------------------------
>
> Key: HDFS-5728
> URL: https://issues.apache.org/jira/browse/HDFS-5728
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 0.23.10, 2.2.0
> Reporter: Vinay
> Assignee: Vinay
> Priority: Critical
> Attachments: HDFS-5728.patch, HDFS-5728.patch
>
>
> 1. Client (regionsever) has opened stream to write its WAL to HDFS. This is
> not one time upload, data will be written slowly.
> 2. One of the DataNode got diskfull ( due to some other data filled up disks)
> 3. Unfortunately block was being written to only this datanode in cluster, so
> client write has also failed.
> 4. After some time disk is made free and all processes are restarted.
> 5. Now HMaster try to recover the file by calling recoverLease.
> At this time recovery was failing saying file length mismatch.
> When checked,
> actual block file length: 62484480
> Calculated block length: 62455808
> This was because, metafile was having crc for only 62455808 bytes, and it
> considered 62455808 as the block size.
> No matter how many times, recovery was continously failing.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)