[ https://issues.apache.org/jira/browse/HDFS-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354893#comment-16354893 ]
Wei-Chiu Chuang commented on HDFS-13117: ---------------------------------------- Hi [~xuchuanyin] thanks for filing the jira. May I understand what's the client application you describe here? Flume for example, can append to a HDFS stream continuously and there's not a "blocking" problem. HBase too. >From a HDFS client perspective, you can configure the number of in-flight >packets during the write. If you configure it to 1 you allow 1 in-flight >packet and the connection appears to be "blocking". In addition, if you have >HDFS transport encryption enabled but you don't have hardware acceleration >enabled, the connection will appear to be "blocking" or slow too because of >the overhead of decrypt/encrypt. There's also a configuration where you can allow NameNode to close a file when it reaches a minimal number of replicas from DataNodes. But I don't think that's what you are asking here. > Proposal to support writing replications to HDFS asynchronously > --------------------------------------------------------------- > > Key: HDFS-13117 > URL: https://issues.apache.org/jira/browse/HDFS-13117 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: xuchuanyin > Priority: Major > > My initial question was as below: > ``` > I've learned that When We write data to HDFS using the interface provided by > HDFS such as 'FileSystem.create', our client will block until all the blocks > and their replications are done. This will cause efficiency problem if we use > HDFS as our final data storage. And many of my colleagues write the data to > local disk in the main thread and copy it to HDFS in another thread. > Obviously, it increases the disk I/O. > > So, is there a way to optimize this usage? I don't want to increase the > disk I/O, neither do I want to be blocked during the writing of extra > replications. > How about writing to HDFS by specifying only one replication in the main > thread and set the actual number of replication in another thread? Or is > there any better way to do this? > ``` > > So my proposal here is to support writing extra replications to HDFS > asynchronously. User can set a minimum replicator as acceptable number of > replications ( < default or expected replicator). When writing to HDFS, user > will only be blocked until the minimum replicator has been finished and HDFS > will continue to complete the extra replications in background.Since HDFS > will periodically check the integrity of all the replications, we can also > leave this work to HDFS itself. > > There are ways to provide the interfaces: > 1. Creating a series of interfaces by adding `acceptableReplication` > parameter to the current interfaces as below: > ``` > Before: > FSDataOutputStream create(Path f, > boolean overwrite, > int bufferSize, > short replication, > long blockSize > ) throws IOException > > After: > FSDataOutputStream create(Path f, > boolean overwrite, > int bufferSize, > short replication, > short acceptableReplication, // minimum number of replication to finish > before return > long blockSize > ) throws IOException > ``` > > 2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or > default) configuration, so user will not have to change any interface and > will benefit from this feature. > > How do you think about this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org