James Clampffer created HDFS-10247:
--------------------------------------

             Summary: libhdfs++: Datanode protocol version mismatch
                 Key: HDFS-10247
                 URL: https://issues.apache.org/jira/browse/HDFS-10247
             Project: Hadoop HDFS
          Issue Type: Sub-task
            Reporter: James Clampffer
            Assignee: James Clampffer


Occasionally "Version Mismatch (Expected: 28, Received: 22794 )" shows up in 
the logs.  This doesn't happen much at all with less than 500 concurrent reads 
and starts happening often enough to be an issue at 1000 concurrent reads.

I've seen 3 distinct numbers: 23050 (most common), 22538, and 22794.  If you 
break these shorts into bytes you get
{code}
23050 -> [90,10]
22794 -> [89,10]
22538 -> [88,10]
{code}

Interestingly enough if we dump buffers holding protobuf messages just before 
they hit the wire we see things like the following with the first two bytes as 
90,10
{code}
buffer 
={90,10,82,10,64,10,52,10,37,66,80,45,49,51,56,49,48,51,51,57,57,49,45,49,50,55,46,48,46,48,46,49,45,49,52,53,57,53,50,53,54,49,53,55,50,53,16,-127,-128,-128,-128,4,24,-23,7,32,-128,-128,64,18,8,10,0,18,0,26,0,34,0,18,14,108,105,98,104,100,102,115,43,43,95,75,67,43,49,16,0,24,23,32,1}
{code}

The first 3 bytes the DN is expecting for an unsecured read block request = 
{code}
{0,28,81} //[0, 28]->a short for protocol, 81 is read block opcode
{code}

This seems like either connections are getting swapped between readers or
the header isn't being sent for some reason but the protobuf message is.

I've ruled out memory stomps on the header data (see HDFS-10241) by sticking 
the 3 byte header in it's own static buffer that all requests use.

Some notes:
-The mismatched number will stay the same for the duration of a stress test.
-The mismatch is distributed fairly evenly throughout the logs




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to