Daryn Sharp created HDFS-12639:
----------------------------------

             Summary: BPOfferService lock may stall all service actors
                 Key: HDFS-12639
                 URL: https://issues.apache.org/jira/browse/HDFS-12639
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.8.0
            Reporter: Daryn Sharp


{{BPOfferService}} manages {{BPServiceActor}} instances for the active and 
standby.  It uses a RW lock to primarily protect registration information while 
determining the active/standby from heartbeats.

Unfortunately the write lock is held during command processing.  If an actor is 
experiencing high latency processing commands, the other actor will neither be 
able to register (blocked in createRegistration, setNamespaceInfo, 
verifyAndSetNamespaceInfo) nor process heartbeats (blocked in 
updateActorStatesFromHeartbeat).

The worst case scenario for processing commands while holding the lock is 
re-registration.  The actor will loop, catching and logging exceptions, leaving 
the other actor blocked for an non-deterministic (possibly infinite) amount of 
time.

The lock must not be held during command processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to