[ 
https://issues.apache.org/jira/browse/HDFS-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202766#comment-16202766
 ] 

Hanisha Koneru commented on HDFS-12639:
---------------------------------------

Hi [~daryn],

This is my understanding of the problem. Please correct me if I am wrong.

BPServiceActor obtains the writeLock before processing each command and 
releases it after. During the processing of a single command, the other actor 
would not be able to register or process heartbeats.

If we remove the write lock held during command processing, then command 
processing would no longer be asynchronous. Not sure if this opens up the 
possibility of creating anomalies in the datanode. 

bq. The worst case scenario for processing commands while holding the lock is 
re-registration. The actor will loop, catching and logging exceptions, leaving 
the other actor blocked for an non-deterministic (possibly infinite) amount of 
time.

The re-registration process itself does not acquire the write lock to register 
Datanode right? (It needs the write lock to check that the new registration 
info is consistent with the storage). 
Can you please elaborate on how the actor would go into a loop, catching and 
logging exceptions?

> BPOfferService lock may stall all service actors
> ------------------------------------------------
>
>                 Key: HDFS-12639
>                 URL: https://issues.apache.org/jira/browse/HDFS-12639
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Assignee: Hanisha Koneru
>
> {{BPOfferService}} manages {{BPServiceActor}} instances for the active and 
> standby.  It uses a RW lock to primarily protect registration information 
> while determining the active/standby from heartbeats.
> Unfortunately the write lock is held during command processing.  If an actor 
> is experiencing high latency processing commands, the other actor will 
> neither be able to register (blocked in createRegistration, setNamespaceInfo, 
> verifyAndSetNamespaceInfo) nor process heartbeats (blocked in 
> updateActorStatesFromHeartbeat).
> The worst case scenario for processing commands while holding the lock is 
> re-registration.  The actor will loop, catching and logging exceptions, 
> leaving the other actor blocked for an non-deterministic (possibly infinite) 
> amount of time.
> The lock must not be held during command processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to