[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152094#comment-14152094 ]
Colin Patrick McCabe commented on HDFS-3107: -------------------------------------------- Thanks for waiting. I'm checking out the design doc. {code} In proposed approach truncate is performed only on a closed file. If the file is opened for write an attempt to truncate fails. {code} Just a style change, but maybe "Truncate cannot be performed on a file which is currently open for writing" would be clearer. {code} Conceptually, truncate removes all full blocks of the file and then starts a recovery process for the last block if it is not fully truncated. The truncate recovery is similar to standard HDFS lease recovery procedure. That is, NameNode sends a DatanodeCommand to one of the DataNodes containing block replicas. The primary DataNode synchronizes the new length among the replicas, and then confirms it to the NameNode by sending commitBlockSynchronization() message, which completes the truncate. Until the truncate recovery is complete the file is assigned a lease, which revokes the ability for other clients to modify that file. {code} I think a diagram might help here. The impression I'm getting is that we have some "truncation point" like this: {code} truncation point | V +-----+-----+-----+-----+-----+-----+ | A | B | C | D | E | F | +-----+-----+-----+-----+-----+-----+ {code} In this case, blocks E and F would be invalidated by the NameNode, and block recovery would begin on block D? "Conceptually, truncate removes all full blocks of the file" seems to suggest we're removing all blocks, so it might be nice to rewrite this as "Truncate removes all full blocks after the truncation point." {code} Full blocks if any are deleted instantaneously. And if there is nothing more to truncate NameNode returns success to the client. {code} They're invalidated instantly, but not deleted instantly, right? Clients may still be reading from them on the various datanodes. {code} public boolean truncate(Path src, long newLength) throws IOException; Truncate file src to the specified newLength. Returns: - true if the file have been truncated to the desired newLength and is immediately available to be reused for write operations such as append, or - false if a background process of adjusting the length of the last block has been started, and clients should wait for it to complete before they can proceed with further file updates. {code} Hmm, do we really need the boolean here? It seems like the client could simply try to reopen the file until it no longer got an {{RecoveryInProgressException.}} (or lease exception, as the case may be.) The client will have to do this anyway most of the time, since most truncates don't fall on even block boundaries. {code} It should be noted that applications that cache data may still see old bytes of the file stored in the cache. It is advised for such applications to incorporate techniques, which would retire cache when the data is truncated. {code} One issue that I see here is that {{DFSInputStream}} users will continue to see the old, longer length for a long time potentially. {{DFSInputStream#locatedBlocks}} will continue to have the block information it had prior to truncation. And eventually, whenever they try to read from that longer length, they'll get read failures since the blocks will actually be unlinked. These will look like IOExceptions to the user. I don't know if there's a good way around this problem with the design proposed here. bq. \[truncate with snapshots\] I don't think we should commit anything to trunk until we figure out how this integrates with snapshots. It just impacts the design too much. When you start seriously thinking about snapshots, integrating this with block recovery (by adding {{BEING_TRUNCATED}}, etc.) does not look like a very good option. A better option would be simply to copy the partial block and have the snapshotted version reference the old block, and the new version reference the (shorter) copy. That corresponds to your approach #3, right? truncate is presumably a rare operation and doing the truncation in-place for non-snapshotted files is an optimization we could do later. The copy approach is also nice for {{DFSInputStream}}, since readers can continue reading from the old (longer) copy until the readers close. If we truncated that copy directly, this would not work. We could commit this to a branch, but I think we should hold off on committing to trunk until we figure out the snapshot story. > HDFS truncate > ------------- > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode > Reporter: Lei Chang > Assignee: Plamen Jeliazkov > Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, > HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, > HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, > editsStored > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)