[ 
https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538657
 ] 

dhruba borthakur commented on HADOOP-1707:
------------------------------------------

I agree that that timeout issue does not have a very elegant solution. Here is 
a new proposal.

The Client
--------------
1. The Client uses a small pool of memory buffers per dfs-output stream. Say, 
10 buffers of size 64K each.
2. A write to the output stream actually copies the user data into one of the 
buffers, if available. Otherwise the user-write blocks.
3. A separate thread (one per output stream), sends buffers that are full. Each 
buffer has metadata that contains a sequence number (locally generated on the 
client) , the length of the buffer and its offset in this block.
4. Another thread(one per output stream) process incoming responses. The 
incoming response has the sequence number of the buffer that the datanode had 
processed. The client removes that buffer from its queue.

The Primary Datanode
------------------------------
The primary datanode has two threads per stream. The first thread processes 
incoming packets from the client, writes them to the downstream datanode and 
writes them to local disk. The second thread processes responses from 
downstream datanodes and forwards them back to the client.

This means that the client gets back an ack only when the packet is persisted 
on all datanodes. In the future this can be changed so that the client gets an 
ack when the data is persisted in dfs.replication.min number of datanodes.

In case the primary datanode encounters an exception while writing to the 
downstream datanode, it declares the block as bad. It removes the immediate 
downstream datanode from the pipeline. It makes an RPC to the namenode to 
abandon the current blockId and*replace* the block id with a new one. It then 
establishes a new pipeline using the new blockid using the remaining datanodes. 
 It then copies all the data from the local temporary block file to the 
downstream datanodes using the new blockId.

The Secondary Datanodes
------------------------------------
The Secondary datanode has two threads per stream. The first thread processes 
incoming packets from the upstream datanode, writes them to the downstream 
datanode and writes them to local disk. The second thread processes responses 
from downstream datanodes and forwards them back to the upstream datanode.

Each secondary datanode sends its response as well forwards the response of all 
downstream datanodes.



> Remove the DFS Client disk-based cache
> --------------------------------------
>
>                 Key: HADOOP-1707
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1707
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.16.0
>
>
> The DFS client currently uses a staging file on local disk to cache all 
> user-writes to a file. When the staging file accumulates 1 block worth of 
> data, its contents are flushed to a HDFS datanode. These operations occur 
> sequentially.
> A simple optimization of allowing the user to write to another staging file 
> while simultaneously uploading the contents of the first staging file to HDFS 
> will improve file-upload performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to