[jira] [Commented] (HDFS-1257) Race condition introduced by HADOOP-5124

Eric Payne (JIRA) Mon, 27 Jun 2011 08:44:13 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055606#comment-13055606
 ]


Eric Payne commented on HDFS-1257:
----------------------------------

What is the status of this Jira?

I believe that I am also running into this issue. I am using the yahoo_merge 
branch, but it should be the same in all branches.

When running stress tests, the NameNode daemon receives a 
ConcurrentModificationException and exits during certain race conditions.

This seems to be a fairly critical bug that could cause the NameNode to exit 
under stress conditions.

The node configuration I am using is running a single indepent namenode on one 
machine and hundreds of simulated (by MiniDFSCluster) datanodes on each of 9 
other machines, for a total of up to 2000 simulated datanodes.

Than, in this environment, the DataNodeGenerator test is run, which does random 
reads, creates, writes, and deletes. The goal is to stress the NameNode with 
hundreds of operations per second.

In some race conditions, when ReplicationMonitor() is calculating invalid 
blocks, the recentInvalidateSets TreeMap within BlockManager is being modified 
by one thread while the ReplicationMonitor() is iterating over it.

Here is the exception and stack traceback:

2011-06-08 15:33:41,551 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor thread 
received Runtime exception.  
java.util.ConcurrentModificationException
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
        at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
        at java.util.AbstractCollection.toArray(AbstractCollection.java:124)
        at java.util.ArrayList.<init>(ArrayList.java:131)
        at 
org.apache.hadoop.hdfs.server.namenode.BlockManager.computeInvalidateWork(BlockManager.java:682)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.computeDatanodeWork(FSNamesystem.java:2978)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:2925)
        at java.lang.Thread.run(Thread.java:619)



One thing I did try was to go into the BlockManager and put 'synchronized()' 
around all places that iterate over, add to, or remove from the 
recentInvalidateSets TreeMap variable.

I'm not sure what performance (or other unforseen) ramifications this may have.

However, I was able to eliminate the ConcurrentModificationException by using 
this fix, at least in my test
environment.



> Race condition introduced by HADOOP-5124
> ----------------------------------------
>
>                 Key: HDFS-1257
>                 URL: https://issues.apache.org/jira/browse/HDFS-1257
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>            Reporter: Ramkumar Vadali
>         Attachments: HDFS-1257.patch
>
>
> HADOOP-5124 provided some improvements to FSNamesystem#recentInvalidateSets. 
> But it introduced unprotected access to the data structure 
> recentInvalidateSets. Specifically, FSNamesystem.computeInvalidateWork 
> accesses recentInvalidateSets without read-lock protection. If there is 
> concurrent activity (like reducing replication on a file) that adds to 
> recentInvalidateSets, the name-node crashes with a 
> ConcurrentModificationException.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-1257) Race condition introduced by HADOOP-5124

Reply via email to