[ 
https://issues.apache.org/jira/browse/HDFS-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726445#comment-14726445
 ] 

Konstantin Shvachko commented on HDFS-8999:
-------------------------------------------

The current logic as I remember it when it was designed way back:
- Only the client can be trusted about the length of the block, because it 
knows how many bytes it pushed to the DNs.
- DataNodes do not determine the length of the block, only the length of the 
replica in its possession. Because DNs do not sync replica data to disk, and 
because irresponsible users, or untrusted scripts, or bugs in local fs can 
damage replica files.
- When the client allocates a new block or closes the file, it also confirms 
the length of the last written block. That block goes into COMMITTED state, but 
can still remain under-construction until minimal number of replicas are 
reported by DNs.
- IBRs (via blockReceivedAndDeleted()) confirm that replicas actually exist on 
DNs and have the expected length.
- The block needs to be COMMITTED and reported by the minimal number of DNs in 
order to go into COMPLETE state.

[The append design 
document|https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf]
 is somewhat outdated by now, but still gives a good idea how it was intended. 
Need to think more about this optimization proposal.
In general I'd leave the data pipeline alone (as a rather delicate subject) 
unless there is a clear bug.

> Namenode need not wait for {{blockReceived}} for the last block before 
> completing a file.
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-8999
>                 URL: https://issues.apache.org/jira/browse/HDFS-8999
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Jitendra Nath Pandey
>
> This comes out of a discussion in HDFS-8763. Pasting [~jingzhao]'s comment 
> from the jira:
> ...whether we need to let NameNode wait for all the block_received msgs to 
> announce the replica is safe. Looking into the code, now we have
>    # NameNode knows the DataNodes involved when initially setting up the 
> writing pipeline
>    # If any DataNode fails during the writing, client bumps the GS and 
> finally reports all the DataNodes included in the new pipeline to NameNode 
> through the updatePipeline RPC.
>    # When the client received the ack for the last packet of the block (and 
> before the client tries to close the file on NameNode), the replica has been 
> finalized in all the DataNodes.
> Then in this case, when NameNode receives the close request from the client, 
> the NameNode already knows the latest replicas for the block. Currently the 
> checkReplication call only counts in all the replicas that NN has already 
> received the block_received msg, but based on the above #2 and #3, it may be 
> safe to also count in all the replicas in the 
> BlockUnderConstructionFeature#replicas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to