[
https://issues.apache.org/jira/browse/HDFS-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410449#comment-16410449
]
Hudson commented on HDFS-10931:
-------------------------------
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13869 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/13869/])
HDFS-10931: libhdfs++: Fix object lifecycle issues in the BlockReader
(james.clampffer: rev 2a42eeb66f7fbf3fb4fc434480a712c53cf0243a)
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs-tests/test_libhdfs_mini_stress.c
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/reader/block_reader.cc
> libhdfs++: Fix object lifecycle issues in the BlockReader
> ---------------------------------------------------------
>
> Key: HDFS-10931
> URL: https://issues.apache.org/jira/browse/HDFS-10931
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: James Clampffer
> Assignee: James Clampffer
> Priority: Critical
> Attachments: HDFS-10931.HDFS-8707.000.patch,
> HDFS-10931.HDFS-8707.001.patch
>
>
> The BlockReader can work itself into a a state during AckRead (possibly other
> stages as well) where the pipeline posts a task for asio with a pointer back
> into itself, then promptly calls "delete this" without canceling the asio
> request. The asio task finishes and tries to acquire the lock at the address
> where the DataNodeConnection used to live - but the DN connection is no
> longer valid so it's scribbling on some arbitrary bit of memory. On some
> platforms the underlying address used by the mutex state will be handed out
> to future mutexes so the scribble breaks that state and all the locks in that
> process start misbehaving.
> This can be reproduced by using the patch from HDFS-8790 and adding more
> worker threads + a lot more reader threads.
> I'm going to fix this in two parts:
> 1) Duct tape + superglue patch to make sure that all top level continuations
> in the block reader pipeline hold a shared_ptr to the DataNodeConnection.
> Nested continuations also get a copy of the shared_ptr to make sure the
> connection is alive. This at least keeps the connection alive so that it can
> keep returning asio::operation_aborted.
> 2) The continuation stuff needs a lot of work to make sure this type of bug
> doesn't keep popping up. We've already fixed these issues in the RPC code.
> This will most likely need to be split into a few jiras.
> - Continuation "framework" can be slimmed down quite a bit, perhaps even
> removed. Near zero documentation + many implied contracts = constant bug
> chasing.
> - Add comments to actually describe what's going on in the networking code.
> This bug took significantly longer than it should have to track down because
> I hadn't worked on the BlockReader in a while.
> - No more "delete this".
> - Flatten out nested continuations e.g. the guts of BlockReaderImpl::AckRead.
> It's unclear why they were implemented like this in the first place and
> there's no comments to indicate that this was intentional.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]