[ 
https://issues.apache.org/jira/browse/HDFS-9668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195782#comment-15195782
 ] 

Colin Patrick McCabe edited comment on HDFS-9668 at 3/15/16 5:48 PM:
---------------------------------------------------------------------

Hi [[email protected]],

Thanks again for your comments.  I agree that consistency is always a headache. 
 However, we are already "inconsistent" in a bunch of cases.  For example, 
{{FsDatasetSpi#getStoredBlock}} returns a Block structure with a genstamp and 
block ID.  But since it drops the lock when it returns, the genstamp may change 
in between the call to {{getStoredBlock}} and the actual usage of that 
information.

bq. But still it is hard to remove the locks in createRbw, etc where the 
long-time blocking occur. I think this is what we have to tackle in the future.

For {{createRbw}}, it seems like we could:
1. add the entry to the volumeMap
2. drop the lock and attempt to create the block file on-disk
3. if the creation failed, take back the lock and remove the entry from the 
volumeMap

Step #1 would ensure that if another thread attempted to create the same RBW 
replica, it would fail.

bq. But "synchronized" doesn't guarantee fairness, is it fair to ask lock to 
support fairness?

Currently, read operations have no special advantage over write operations.  
Using a reader/writer lock changes that.  It's easy to come up with a workload 
where read requests come in often enough so that there is no time at all for 
write requests.  This is especially true since we are doing filesystem I/O 
while holding the reader lock.  We have observed Java Reader/Writer locks to 
starve writers in practice.  That's why there is an option for the FSNamesystem 
lock to be fair.

Hmm.  I wonder if, as a first step, we could try moving all the filesystem I/O 
that we can outside the lock?  That would provide a huge performance boost just 
by itself.  And it would make it much easier to have a reader/writer lock later 
if required.


was (Author: cmccabe):
Hi [[email protected]],

Thanks again for your comments.  I agree that consistency is always a headache. 
 However, we are already "inconsistent" in a bunch of cases.  For example, 
{{FsDatasetSpi#getStoredBlock}} returns a Block structure with a genstamp and 
block ID.  But since it drops the lock when it returns, the genstamp may change 
in between the call to {{getStoredBlock}} and the actual usage of that 
information.

bq. But still it is hard to remove the locks in createRbw, etc where the 
long-time blocking occur. I think this is what we have to tackle in the future.

For {{createRbw}}, it seems like we could:
1. add the entry to the volumeMap
2. drop the lock and attempt to create the block file on-disk
3. if the creation failed, take back the lock and remove the entry from the 
volumeMap

Step #1 would ensure that if another thread attempted to create the same RBW 
replica, it would fail.

bq. But "synchronized" doesn't guarantee fairness, is it fair to ask lock to 
support fairness?

Currently, read operations have no special advantage over write operations.  
Using a reader/writer lock changes that.  It's easy, even trivial, to come up 
with a workload where read requests come in often enough so that there is no 
time at all for write requests.  This is especially true since we are doing 
filesystem I/O while holding the reader lock.  We have observed Java 
Reader/Writer locks to starve writers in practice.  That's why there is an 
option for the FSNamesystem lock to be fair.

Hmm.  I wonder if, as a first step, we could try moving all the filesystem I/O 
that we can outside the lock?  That would provide a huge performance boost just 
by itself.  And it would make it much easier to have a reader/writer lock later 
if required.

> Optimize the locking in FsDatasetImpl
> -------------------------------------
>
>                 Key: HDFS-9668
>                 URL: https://issues.apache.org/jira/browse/HDFS-9668
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Jingcheng Du
>            Assignee: Jingcheng Du
>         Attachments: HDFS-9668-1.patch, HDFS-9668-2.patch, execution_time.png
>
>
> During the HBase test on a tiered storage of HDFS (WAL is stored in 
> SSD/RAMDISK, and all other files are stored in HDD), we observe many 
> long-time BLOCKED threads on FsDatasetImpl in DataNode. The following is part 
> of the jstack result:
> {noformat}
> "DataXceiver for client DFSClient_NONMAPREDUCE_-1626037897_1 at 
> /192.168.50.16:48521 [Receiving block 
> BP-1042877462-192.168.50.13-1446173170517:blk_1073779272_40852]" - Thread 
> t@93336
>    java.lang.Thread.State: BLOCKED
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1111)
>       - waiting to lock <18324c9> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) owned by 
> "DataXceiver for client DFSClient_NONMAPREDUCE_-1626037897_1 at 
> /192.168.50.16:48520 [Receiving block 
> BP-1042877462-192.168.50.13-1446173170517:blk_1073779271_40851]" t@93335
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:113)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:183)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>       at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
>       - None
>       
> "DataXceiver for client DFSClient_NONMAPREDUCE_-1626037897_1 at 
> /192.168.50.16:48520 [Receiving block 
> BP-1042877462-192.168.50.13-1446173170517:blk_1073779271_40851]" - Thread 
> t@93335
>    java.lang.Thread.State: RUNNABLE
>       at java.io.UnixFileSystem.createFileExclusively(Native Method)
>       at java.io.File.createNewFile(File.java:1012)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:66)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:271)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:286)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1140)
>       - locked <18324c9> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
>       at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:113)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:183)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:615)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>       at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
>       - None
> {noformat}
> We measured the execution of some operations in FsDatasetImpl during the 
> test. Here following is the result.
> !execution_time.png!
> The operations of finalizeBlock, addBlock and createRbw on HDD in a heavy 
> load take a really long time.
> It means one slow operation of finalizeBlock, addBlock and createRbw in a 
> slow storage can block all the other same operations in the same DataNode, 
> especially in HBase when many wal/flusher/compactor are configured.
> We need a finer grained lock mechanism in a new FsDatasetImpl implementation 
> and users can choose the implementation by configuring 
> "dfs.datanode.fsdataset.factory" in DataNode.
> We can implement the lock by either storage level or block-level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to