[ 
https://issues.apache.org/jira/browse/HDFS-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003497#comment-14003497
 ] 

Yongjun Zhang commented on HDFS-6428:
-------------------------------------

Thanks Colin for the off-line discussion. He suggested to find out what caused 
the runtime increase, and I figured out it's the highlighted line in the 
following block that caused long runtime after adding synchronize statement, 
which is understandable because multi-threads are synchronizing here. Actually 
without the synchronization, it would not be safe.

{code}
  void addBlockPool(final String bpid, final Configuration conf) throws 
IOException {
    ...
    List<Thread> blockPoolAddingThreads = new ArrayList<Thread>();
    for (final FsVolumeImpl v : volumes) {
      Thread t = new Thread() {
        public void run() {
          try {
            ...
            v.addBlockPool(bpid, conf); <=====================if synchronized, 
caused slow performance.
            ...
{code}

Why we did not run into problem easily at the above highlighted line? this 
question made me realize that bpSlices is {{ConcurrentHashMap}}, which is 
designed to take care of most of concurrency issue:
{code}
The allowed concurrency among update operations is guided by the optional 
concurrencyLevel constructor argument (default 16), which is used as a hint for 
internal sizing. The table is internally partitioned to try to permit the 
indicated number of concurrent updates without contention. Because placement in 
hash tables is essentially random, the actual concurrency will vary. Ideally, 
you should choose a value to accommodate as many threads as will ever 
concurrently modify the table. 
{code}
So I think adding another level of synchronization for the addBlockPool and 
some of the other operations are not necessary (though some may really need). 
The real fix should be based on the ConcurrentHashMap requirements.

The other day when I worked out the patch, it was very reproducible in my env, 
but now it's not unfortunately (because I cleaned my build, and the nature of 
this issue), so I can't verify whether a new fix resolve the problem. I will 
keep watching if I can see this issue again.

BTW [~daryn], thanks for your "Do we know what else is modifying bpSlices and 
causing the CME? Hopefully we aren't masking another bug." comments. The 
discussion we had so far is along this line.

Thanks.

> TestWebHdfsWithMultipleNameNodes failed with ConcurrentModificationException
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-6428
>                 URL: https://issues.apache.org/jira/browse/HDFS-6428
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: webhdfs
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-6428.001.patch
>
>
> TestWebHdfsWithMultipleNameNodes failed as follows:
> {code}
> Running org.apache.hadoop.hdfs.web.TestWebHdfsWithMultipleNameNodes
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.643 sec <<< 
> FAILURE! - in org.apache.hadoop.hdfs.web.TestWebHdfsWithMultipleNameNodes
> org.apache.hadoop.hdfs.web.TestWebHdfsWithMultipleNameNodes  Time elapsed: 
> 3.771 sec  <<< ERROR!
> java.util.ConcurrentModificationException: null
>         at java.util.HashMap$HashIterator.nextEntry(HashMap.java:894)
>         at java.util.HashMap$EntryIterator.next(HashMap.java:934)
>         at java.util.HashMap$EntryIterator.next(HashMap.java:932)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:251)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.shutdown(FsVolumeList.java:249)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.shutdown(FsDatasetImpl.java:1389)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:1304)
>         at 
> org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:1555)
>         at 
> org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1530)
>         at 
> org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1514)
>         at 
> org.apache.hadoop.hdfs.web.TestWebHdfsWithMultipleNameNodes.shutdownCluster(TestWebHdfsWithMultipleNameNodes.java:99)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to