[ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152094#comment-14152094
 ] 

Colin Patrick McCabe commented on HDFS-3107:
--------------------------------------------

Thanks for waiting.  I'm checking out the design doc.

{code}
In proposed approach truncate is performed only on a closed file. If the file 
is opened for write an 
attempt to truncate fails.
{code}

Just a style change, but maybe "Truncate cannot be performed on a file which is 
currently open for writing" would be clearer.

{code}
Conceptually, truncate removes all full blocks of the file and then starts a 
recovery process for the 
last block if it is not fully truncated. The truncate recovery is similar to 
standard HDFS lease recovery
procedure. That is, NameNode sends a DatanodeCommand to one of the DataNodes 
containing block 
replicas. The primary DataNode synchronizes the new length among the replicas, 
and then confirms it to 
the NameNode by sending commitBlockSynchronization() message, which completes 
the 
truncate. Until the truncate recovery is complete the file is assigned a lease, 
which revokes the ability for 
other clients to modify that file.
{code}

I think a diagram might help here.  The impression I'm getting is that we have 
some "truncation point" like this:
{code}
               truncation point
                    |
                    V 
+-----+-----+-----+-----+-----+-----+
| A   | B   | C   | D   | E   | F   |
+-----+-----+-----+-----+-----+-----+
{code}
In this case, blocks E and F would be invalidated by the NameNode, and block 
recovery would begin on block D?

"Conceptually, truncate removes all full blocks of the file" seems to suggest 
we're removing all blocks, so it might be nice to rewrite this as "Truncate 
removes all full blocks after the truncation point."

{code}
 Full blocks if any are deleted instantaneously. And if there is nothing more 
to truncate NameNode returns success to the client.
{code}

They're invalidated instantly, but not deleted instantly, right?  Clients may 
still be reading from them on the various datanodes.

{code}
public boolean truncate(Path src, long newLength)
throws IOException;
Truncate file src to the specified newLength.
Returns:
- true if the file have been truncated to the desired newLength and is 
immediately available to 
be reused for write operations such as append, or
- false if a background process of adjusting the length of the last block has 
been started, and 
clients should wait for it to complete before they can proceed with further 
file updates.
{code}

Hmm, do we really need the boolean here?  It seems like the client could simply 
try to reopen the file until it no longer got an 
{{RecoveryInProgressException.}} (or lease exception, as the case may be.)  The 
client will have to do this anyway most of the time, since most truncates don't 
fall on even block boundaries.

{code}
 It should be noted that applications that cache data may still see old bytes 
of the file stored 
in the cache. It is advised for such applications to incorporate techniques, 
which would retire cache 
when the data is truncated.
{code}

One issue that I see here is that {{DFSInputStream}} users will continue to see 
the old, longer length for a long time potentially.  
{{DFSInputStream#locatedBlocks}} will continue to have the block information it 
had prior to truncation.  And eventually, whenever they try to read from that 
longer length, they'll get read failures since the blocks will actually be 
unlinked.  These will look like IOExceptions to the user.  I don't know if 
there's a good way around this problem with the design proposed here.

bq. \[truncate with snapshots\]

I don't think we should commit anything to trunk until we figure out how this 
integrates with snapshots.  It just impacts the design too much.  When you 
start seriously thinking about snapshots, integrating this with block recovery 
(by adding {{BEING_TRUNCATED}}, etc.) does not look like a very good option.  A 
better option would be simply to copy the partial block and have the 
snapshotted version reference the old block, and the new version reference the 
(shorter) copy.  That corresponds to your approach #3, right?  truncate is 
presumably a rare operation and doing the truncation in-place for 
non-snapshotted files is an optimization we could do later.

The copy approach is also nice for {{DFSInputStream}}, since readers can 
continue reading from the old (longer) copy until the readers close.  If we 
truncated that copy directly, this would not work.

We could commit this to a branch, but I think we should hold off on committing 
to trunk until we figure out the snapshot story.

> HDFS truncate
> -------------
>
>                 Key: HDFS-3107
>                 URL: https://issues.apache.org/jira/browse/HDFS-3107
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Lei Chang
>            Assignee: Plamen Jeliazkov
>         Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to