James Clampffer created HDFS-10931:
--------------------------------------

             Summary: libhdfs++: Fix object lifecycle issues in the BlockReader
                 Key: HDFS-10931
                 URL: https://issues.apache.org/jira/browse/HDFS-10931
             Project: Hadoop HDFS
          Issue Type: Sub-task
            Reporter: James Clampffer
            Assignee: James Clampffer
            Priority: Critical


The BlockReader can work itself into a a state during AckRead (possibly other 
stages as well) where the pipeline posts a task for asio with a pointer back 
into itself, then promptly calls "delete this" without canceling the asio 
request.  The asio task finishes and tries to acquire the lock at the address 
where the DataNodeConnection used to live - but the DN connection is no longer 
valid so it's scribbling on some arbitrary bit of memory.  On some platforms 
the underlying address used by the mutex state will be handed out to future 
mutexes so the scribble breaks that state and all the locks in that process 
start misbehaving.

This can be reproduced by using the patch from HDFS-8790 and adding more worker 
threads + a lot more reader threads.

I'm going to fix this in two parts:
1) Duct tape + superglue patch to make sure that all top level continuations in 
the block reader pipeline hold a shared_ptr to the DataNodeConnection.  Nested 
continuations also get a copy of the shared_ptr to make sure the connection is 
alive.  This at least keeps the connection alive so that it can keep returning 
asio::operation_aborted.

2) The continuation stuff needs a lot of work to make sure this type of bug 
doesn't keep popping up.  We've already fixed these issues in the RPC code.  
This will most likely need to be split into a few jiras.
- Continuation "framework" can be slimmed down quite a bit, perhaps even 
removed.  Near zero documentation + many implied contracts = constant bug 
chasing.
- Add comments to actually describe what's going on in the networking code.  
This bug took significantly longer than it should have to track down because I 
hadn't worked on the BlockReader in a while.
- No more "delete this".
- Flatten out nested continuations e.g. the guts of BlockReaderImpl::AckRead.  
It's unclear why they were implemented like this in the first place and there's 
no comments to indicate that this was intentional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to