[jira] [Commented] (HDFS-10931) libhdfs++: Fix object lifecycle issues in the BlockReader

Hudson (JIRA) Thu, 22 Mar 2018 14:58:53 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410449#comment-16410449
 ]


Hudson commented on HDFS-10931:
-------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13869 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13869/])
HDFS-10931: libhdfs++: Fix object lifecycle issues in the BlockReader 
(james.clampffer: rev 2a42eeb66f7fbf3fb4fc434480a712c53cf0243a)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs-tests/test_libhdfs_mini_stress.c
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/reader/block_reader.cc


> libhdfs++: Fix object lifecycle issues in the BlockReader
> ---------------------------------------------------------
>
>                 Key: HDFS-10931
>                 URL: https://issues.apache.org/jira/browse/HDFS-10931
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>            Priority: Critical
>         Attachments: HDFS-10931.HDFS-8707.000.patch, 
> HDFS-10931.HDFS-8707.001.patch
>
>
> The BlockReader can work itself into a a state during AckRead (possibly other 
> stages as well) where the pipeline posts a task for asio with a pointer back 
> into itself, then promptly calls "delete this" without canceling the asio 
> request.  The asio task finishes and tries to acquire the lock at the address 
> where the DataNodeConnection used to live - but the DN connection is no 
> longer valid so it's scribbling on some arbitrary bit of memory.  On some 
> platforms the underlying address used by the mutex state will be handed out 
> to future mutexes so the scribble breaks that state and all the locks in that 
> process start misbehaving.
> This can be reproduced by using the patch from HDFS-8790 and adding more 
> worker threads + a lot more reader threads.
> I'm going to fix this in two parts:
> 1) Duct tape + superglue patch to make sure that all top level continuations 
> in the block reader pipeline hold a shared_ptr to the DataNodeConnection.  
> Nested continuations also get a copy of the shared_ptr to make sure the 
> connection is alive.  This at least keeps the connection alive so that it can 
> keep returning asio::operation_aborted.
> 2) The continuation stuff needs a lot of work to make sure this type of bug 
> doesn't keep popping up.  We've already fixed these issues in the RPC code.  
> This will most likely need to be split into a few jiras.
> - Continuation "framework" can be slimmed down quite a bit, perhaps even 
> removed.  Near zero documentation + many implied contracts = constant bug 
> chasing.
> - Add comments to actually describe what's going on in the networking code.  
> This bug took significantly longer than it should have to track down because 
> I hadn't worked on the BlockReader in a while.
> - No more "delete this".
> - Flatten out nested continuations e.g. the guts of BlockReaderImpl::AckRead. 
>  It's unclear why they were implemented like this in the first place and 
> there's no comments to indicate that this was intentional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10931) libhdfs++: Fix object lifecycle issues in the BlockReader

Reply via email to