[
https://issues.apache.org/jira/browse/HDFS-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410420#comment-16410420
]
Hudson commented on HDFS-10247:
-------------------------------
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13869 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/13869/])
HDFS-10247: libhdfs++: Datanode protocol version mismatch fix.
(james.clampffer: rev 60c3437267b864a01d27783535906c6a7e81058e)
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/reader/block_reader.cc
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/common/continuation/asio.h
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/common/continuation/protobuf.h
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/common/util.h
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/common/util.cc
* (edit)
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/reader/datatransfer_impl.h
> libhdfs++: Datanode protocol version mismatch
> ---------------------------------------------
>
> Key: HDFS-10247
> URL: https://issues.apache.org/jira/browse/HDFS-10247
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: James Clampffer
> Assignee: James Clampffer
> Priority: Major
> Attachments: HDFS-10247.HDFS-8707.000.patch,
> HDFS-10247.HDFS-8707.001.patch, HDFS-10247.HDFS-8707.002.patch
>
>
> Occasionally "Version Mismatch (Expected: 28, Received: 22794 )" shows up in
> the logs. This doesn't happen much at all with less than 500 concurrent
> reads and starts happening often enough to be an issue at 1000 concurrent
> reads.
> I've seen 3 distinct numbers: 23050 (most common), 22538, and 22794. If you
> break these shorts into bytes you get
> {code}
> 23050 -> [90,10]
> 22794 -> [89,10]
> 22538 -> [88,10]
> {code}
> Interestingly enough if we dump buffers holding protobuf messages just before
> they hit the wire we see things like the following with the first two bytes
> as 90,10
> {code}
> buffer
> ={90,10,82,10,64,10,52,10,37,66,80,45,49,51,56,49,48,51,51,57,57,49,45,49,50,55,46,48,46,48,46,49,45,49,52,53,57,53,50,53,54,49,53,55,50,53,16,-127,-128,-128,-128,4,24,-23,7,32,-128,-128,64,18,8,10,0,18,0,26,0,34,0,18,14,108,105,98,104,100,102,115,43,43,95,75,67,43,49,16,0,24,23,32,1}
> {code}
> The first 3 bytes the DN is expecting for an unsecured read block request =
> {code}
> {0,28,81} //[0, 28]->a short for protocol, 81 is read block opcode
> {code}
> This seems like either connections are getting swapped between readers or
> the header isn't being sent for some reason but the protobuf message is.
> I've ruled out memory stomps on the header data (see HDFS-10241) by sticking
> the 3 byte header in it's own static buffer that all requests use.
> Some notes:
> -The mismatched number will stay the same for the duration of a stress test.
> -The mismatch is distributed fairly evenly throughout the logs
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]