[ https://issues.apache.org/jira/browse/HDFS-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Clampffer updated HDFS-10931: ----------------------------------- Attachment: HDFS-10931.HDFS-8707.000.patch Patch added for the first part of the problem. Gratuitous use of shared_ptr to keep the DataNodeConnection alive. The fundamental fixes to the architecture can be addressed later on. > libhdfs++: Fix object lifecycle issues in the BlockReader > --------------------------------------------------------- > > Key: HDFS-10931 > URL: https://issues.apache.org/jira/browse/HDFS-10931 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: James Clampffer > Priority: Critical > Attachments: HDFS-10931.HDFS-8707.000.patch > > > The BlockReader can work itself into a a state during AckRead (possibly other > stages as well) where the pipeline posts a task for asio with a pointer back > into itself, then promptly calls "delete this" without canceling the asio > request. The asio task finishes and tries to acquire the lock at the address > where the DataNodeConnection used to live - but the DN connection is no > longer valid so it's scribbling on some arbitrary bit of memory. On some > platforms the underlying address used by the mutex state will be handed out > to future mutexes so the scribble breaks that state and all the locks in that > process start misbehaving. > This can be reproduced by using the patch from HDFS-8790 and adding more > worker threads + a lot more reader threads. > I'm going to fix this in two parts: > 1) Duct tape + superglue patch to make sure that all top level continuations > in the block reader pipeline hold a shared_ptr to the DataNodeConnection. > Nested continuations also get a copy of the shared_ptr to make sure the > connection is alive. This at least keeps the connection alive so that it can > keep returning asio::operation_aborted. > 2) The continuation stuff needs a lot of work to make sure this type of bug > doesn't keep popping up. We've already fixed these issues in the RPC code. > This will most likely need to be split into a few jiras. > - Continuation "framework" can be slimmed down quite a bit, perhaps even > removed. Near zero documentation + many implied contracts = constant bug > chasing. > - Add comments to actually describe what's going on in the networking code. > This bug took significantly longer than it should have to track down because > I hadn't worked on the BlockReader in a while. > - No more "delete this". > - Flatten out nested continuations e.g. the guts of BlockReaderImpl::AckRead. > It's unclear why they were implemented like this in the first place and > there's no comments to indicate that this was intentional. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org