[ 
https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354893#comment-16354893
 ] 

Wei-Chiu Chuang commented on HDFS-13117:
----------------------------------------

Hi [~xuchuanyin] thanks for filing the jira.

May I understand what's the client application you describe here? Flume for 
example, can append to a HDFS stream continuously and there's not a "blocking" 
problem. HBase too.

>From a HDFS client perspective, you can configure the number of in-flight 
>packets during the write. If you configure it to 1 you allow 1 in-flight 
>packet and the connection appears to be "blocking". In addition, if you have 
>HDFS transport encryption enabled but you don't have hardware acceleration 
>enabled, the connection will appear to be "blocking" or slow too because of 
>the overhead of decrypt/encrypt.

 

There's also a configuration where you can allow NameNode to close a file when 
it reaches a minimal number of replicas from DataNodes. But I don't think 
that's what you are asking here.

> Proposal to support writing replications to HDFS asynchronously
> ---------------------------------------------------------------
>
>                 Key: HDFS-13117
>                 URL: https://issues.apache.org/jira/browse/HDFS-13117
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: xuchuanyin
>            Priority: Major
>
> My initial question was as below:
> ```
> I've learned that When We write data to HDFS using the interface provided by 
> HDFS such as 'FileSystem.create', our client will block until all the blocks 
> and their replications are done. This will cause efficiency problem if we use 
> HDFS as our final data storage. And many of my colleagues write the data to 
> local disk in the main thread and copy it to HDFS in another thread. 
> Obviously, it increases the disk I/O.
>  
>    So, is there a way to optimize this usage? I don't want to increase the 
> disk I/O, neither do I want to be blocked during the writing of extra 
> replications.
>   How about writing to HDFS by specifying only one replication in the main 
> thread and set the actual number of replication in another thread? Or is 
> there any better way to do this?
> ```
>  
> So my proposal here is to support writing extra replications to HDFS 
> asynchronously. User can set a minimum replicator as acceptable number of 
> replications ( < default or expected replicator). When writing to HDFS, user 
> will only be blocked until the minimum replicator has been finished and HDFS 
> will continue to complete the extra replications in background.Since HDFS 
> will periodically check the integrity of all the replications, we can also 
> leave this work to HDFS itself.
>  
> There are ways to provide the interfaces:
> 1. Creating a series of interfaces by adding `acceptableReplication` 
> parameter to the current interfaces as below:
> ```
> Before:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   long blockSize
> ) throws IOException
>  
> After:
> FSDataOutputStream create(Path f,
>   boolean overwrite,
>   int bufferSize,
>   short replication,
>   short acceptableReplication, // minimum number of replication to finish 
> before return
>   long blockSize
> ) throws IOException
> ```
>  
> 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or 
> default) configuration, so user will not have to change any interface and 
> will benefit from this feature.
>  
> How do you think about this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to