[ 
https://issues.apache.org/jira/browse/HDFS-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-872:
-----------------------------

    Attachment: hdfs-872.txt

Attached is a patch against current branch-0.20 which resolves the protocol 
incompatibility of the HDFS-101/HDFS-793 pair. Since this is tricky code, I'll 
try to summarize the patch in detail:

In HDFS-793, PipelineAck's wire format includes a new element which is the 
number of status replies to follow. This is the central incompatibility. So, in 
this patch, I removed that field and reset the version number back to the 
original 14 from old branch-0.20. To know how many status replies to read, it 
now takes the downstream pipeline depth as a constructor parameter. This is 
used only for reading, and otherwise is -1 (it's an error to call readFields if 
it's not been set)

Since the number of replies in a pipeline ack is no longer dynamic, I removed 
the getNumOfReplies call as well.

When reading an ack, I check for the HEARTBEAT message, and in that case don't 
read any replies. Otherwise I expect a reply from each downstream datanode. 
*For review:* should readFields handle the case of a sequence number equal to 
-2? Best I can tell, the current code never sends such a sequence number, and 
if it does it is an error. It may make sense to check for it and throw an 
IOException in the case of a negative seqno that is not HEARTBEAT_SEQNO. 
Opinions appreciated.

In DFSClient I added a DEBUG level printout for the contents of the pipeline. 
This was useful to me as I was testing, to ensure that I tested killing each of 
the nodes in the pipeline in the intended order.

In BlockReciever, I added back the "continue" during HEARTBEAT processing. I 
believe this was an omission in the earlier patch - best I can tell, without 
the continue, it currently sends a spurious "seqno=-2" ack after each 
heartbeat. With the continue call, it circles around the loop correctly to wait 
for the next ack.

*For review*: I put a TODO for the case where BlockReceiver receives a seqno = 
-2. I currently believe that any negative sequence number that is not 
HEART_BEAT is an error and should throw an IOException (eg we got our reads 
misaligned).

When constructing the ack message for a failed mirror, since every ack must 
have the same number of replies, I send SUCCESS followed by N errors, where N 
is the number of downstream targets. The client's behavior is to eject the 
first ERROR node, so the presence of ERROR status further downstream is 
unimportant - in truth they are semantically UNKNOWN, but no such status code 
exists. *For review*: HDFS-793 reversed the order of the loop on 
DFSClient.java:2431 to locate the _last_ DN with ERROR status. I had to reverse 
this back to the original loop order for this patch since the replies look like 
SUCCESS, ERROR, ERROR in the case that DN 2 dies.

In terms of testing, I performed the following:

- Start up 3 node distributed cluster using patched servers
- With an unpatched client, began uploading a file. Killed each node in the 
pipeline (first, second, last) and ensured that the correct datanode was 
ejected.
- With a patched client and patched server, ran the same test.
- With patched client and unpatched server, ensured that file uploads work 
properly. I did not test killing the unpatched server nodes here - I can do so 
if necessary, but was using a shared cluster for this test.

In all cases, the file upload lasted more than 30 seconds, so heartbeats were 
tested.

> DFSClient 0.20.1 is incompatible with HDFS 0.20.2
> -------------------------------------------------
>
>                 Key: HDFS-872
>                 URL: https://issues.apache.org/jira/browse/HDFS-872
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1, 0.20.2
>            Reporter: Bassam Tabbara
>             Fix For: 0.20.2
>
>         Attachments: hdfs-872.txt
>
>
> After upgrading to that latest HDFS 0.20.2 (r896310 from 
> /branches/branch-0.20), old DFS clients (0.20.1) seem to not work anymore. 
> HBase uses the 0.20.1 hadoop core jars and the HBase master will no longer 
> startup. Here is the exception from the HBase master log:
> {code}
> 2010-01-06 09:59:46,762 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
> java.io.IOException: Could not obtain block: blk_338051
> 2596555557728_1002 file=/hbase/hbase.version
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1788)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1673)
>         at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:320)
>         at java.io.DataInputStream.readUTF(DataInputStream.java:572)
>         at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:189)
>         at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:208)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:208)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1241)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1282)
> 2010-01-06 09:59:46,763 FATAL org.apache.hadoop.hbase.master.HMaster: Not 
> starting HMaster because:
> java.io.IOException: Could not obtain block: blk_3380512596555557728_1002 
> file=/hbase/hbase.version
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1788)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1673)
>         at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:320)
>         at java.io.DataInputStream.readUTF(DataInputStream.java:572)
>         at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:189)
>         at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:208)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:208)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1241)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1282)
> {code}
> If I switch the hadoop jars in the hbase/lib directory with 0.20.2 version it 
> works well, which what led me to open this bug here and not in the HBASE 
> project.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to