[ 
https://issues.apache.org/jira/browse/HDFS-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Clampffer updated HDFS-10931:
-----------------------------------
    Attachment: HDFS-10931.HDFS-8707.000.patch

Patch added for the first part of the problem.  Gratuitous use of shared_ptr to 
keep the DataNodeConnection alive.  The fundamental fixes to the architecture 
can be addressed later on.

> libhdfs++: Fix object lifecycle issues in the BlockReader
> ---------------------------------------------------------
>
>                 Key: HDFS-10931
>                 URL: https://issues.apache.org/jira/browse/HDFS-10931
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>            Priority: Critical
>         Attachments: HDFS-10931.HDFS-8707.000.patch
>
>
> The BlockReader can work itself into a a state during AckRead (possibly other 
> stages as well) where the pipeline posts a task for asio with a pointer back 
> into itself, then promptly calls "delete this" without canceling the asio 
> request.  The asio task finishes and tries to acquire the lock at the address 
> where the DataNodeConnection used to live - but the DN connection is no 
> longer valid so it's scribbling on some arbitrary bit of memory.  On some 
> platforms the underlying address used by the mutex state will be handed out 
> to future mutexes so the scribble breaks that state and all the locks in that 
> process start misbehaving.
> This can be reproduced by using the patch from HDFS-8790 and adding more 
> worker threads + a lot more reader threads.
> I'm going to fix this in two parts:
> 1) Duct tape + superglue patch to make sure that all top level continuations 
> in the block reader pipeline hold a shared_ptr to the DataNodeConnection.  
> Nested continuations also get a copy of the shared_ptr to make sure the 
> connection is alive.  This at least keeps the connection alive so that it can 
> keep returning asio::operation_aborted.
> 2) The continuation stuff needs a lot of work to make sure this type of bug 
> doesn't keep popping up.  We've already fixed these issues in the RPC code.  
> This will most likely need to be split into a few jiras.
> - Continuation "framework" can be slimmed down quite a bit, perhaps even 
> removed.  Near zero documentation + many implied contracts = constant bug 
> chasing.
> - Add comments to actually describe what's going on in the networking code.  
> This bug took significantly longer than it should have to track down because 
> I hadn't worked on the BlockReader in a while.
> - No more "delete this".
> - Flatten out nested continuations e.g. the guts of BlockReaderImpl::AckRead. 
>  It's unclear why they were implemented like this in the first place and 
> there's no comments to indicate that this was intentional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to