xuchuanyin created HDFS-13117:
---------------------------------
Summary: Proposal to support writing replications to HDFS
asynchronously
Key: HDFS-13117
URL: https://issues.apache.org/jira/browse/HDFS-13117
Project: Hadoop HDFS
Issue Type: New Feature
Reporter: xuchuanyin
My initial question was as below:
```
I've learned that When We write data to HDFS using the interface provided by
HDFS such as 'FileSystem.create', our client will block until all the blocks
and their replications are done. This will cause efficiency problem if we use
HDFS as our final data storage. And many of my colleagues write the data to
local disk in the main thread and copy it to HDFS in another thread. Obviously,
it increases the disk I/O.
So, is there a way to optimize this usage? I don't want to increase the disk
I/O, neither do I want to be blocked during the writing of extra replications.
How about writing to HDFS by specifying only one replication in the main
thread and set the actual number of replication in another thread? Or is there
any better way to do this?
```
So my proposal here is to support writing extra replications to HDFS
asynchronously. User can set a minimum replicator as acceptable number of
replications ( < default or expected replicator). When writing to HDFS, user
will only be blocked until the minimum replicator has been finished and HDFS
will continue to complete the extra replications in background.Since HDFS will
periodically check the integrity of all the replications, we can also leave
this work to HDFS itself.
There are ways to provide the interfaces:
1. Creating a series of interfaces by adding `acceptableReplication` parameter
to the current interfaces as below:
```
Before:
FSDataOutputStream create(Path f,
boolean overwrite,
int bufferSize,
short replication,
long blockSize
) throws IOException
After:
FSDataOutputStream create(Path f,
boolean overwrite,
int bufferSize,
short replication,
short acceptableReplication, // minimum number of replication to finish
before return
long blockSize
) throws IOException
```
2. Adding the `acceptableReplication` and `asynchronous` to the runtime (or
default) configuration, so user will not have to change any interface and will
benefit from this feature.
How do you think about this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]