[
https://issues.apache.org/jira/browse/HDFS-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-872:
-----------------------------
Attachment: hdfs-872.txt
Attached is a patch against current branch-0.20 which resolves the protocol
incompatibility of the HDFS-101/HDFS-793 pair. Since this is tricky code, I'll
try to summarize the patch in detail:
In HDFS-793, PipelineAck's wire format includes a new element which is the
number of status replies to follow. This is the central incompatibility. So, in
this patch, I removed that field and reset the version number back to the
original 14 from old branch-0.20. To know how many status replies to read, it
now takes the downstream pipeline depth as a constructor parameter. This is
used only for reading, and otherwise is -1 (it's an error to call readFields if
it's not been set)
Since the number of replies in a pipeline ack is no longer dynamic, I removed
the getNumOfReplies call as well.
When reading an ack, I check for the HEARTBEAT message, and in that case don't
read any replies. Otherwise I expect a reply from each downstream datanode.
*For review:* should readFields handle the case of a sequence number equal to
-2? Best I can tell, the current code never sends such a sequence number, and
if it does it is an error. It may make sense to check for it and throw an
IOException in the case of a negative seqno that is not HEARTBEAT_SEQNO.
Opinions appreciated.
In DFSClient I added a DEBUG level printout for the contents of the pipeline.
This was useful to me as I was testing, to ensure that I tested killing each of
the nodes in the pipeline in the intended order.
In BlockReciever, I added back the "continue" during HEARTBEAT processing. I
believe this was an omission in the earlier patch - best I can tell, without
the continue, it currently sends a spurious "seqno=-2" ack after each
heartbeat. With the continue call, it circles around the loop correctly to wait
for the next ack.
*For review*: I put a TODO for the case where BlockReceiver receives a seqno =
-2. I currently believe that any negative sequence number that is not
HEART_BEAT is an error and should throw an IOException (eg we got our reads
misaligned).
When constructing the ack message for a failed mirror, since every ack must
have the same number of replies, I send SUCCESS followed by N errors, where N
is the number of downstream targets. The client's behavior is to eject the
first ERROR node, so the presence of ERROR status further downstream is
unimportant - in truth they are semantically UNKNOWN, but no such status code
exists. *For review*: HDFS-793 reversed the order of the loop on
DFSClient.java:2431 to locate the _last_ DN with ERROR status. I had to reverse
this back to the original loop order for this patch since the replies look like
SUCCESS, ERROR, ERROR in the case that DN 2 dies.
In terms of testing, I performed the following:
- Start up 3 node distributed cluster using patched servers
- With an unpatched client, began uploading a file. Killed each node in the
pipeline (first, second, last) and ensured that the correct datanode was
ejected.
- With a patched client and patched server, ran the same test.
- With patched client and unpatched server, ensured that file uploads work
properly. I did not test killing the unpatched server nodes here - I can do so
if necessary, but was using a shared cluster for this test.
In all cases, the file upload lasted more than 30 seconds, so heartbeats were
tested.
> DFSClient 0.20.1 is incompatible with HDFS 0.20.2
> -------------------------------------------------
>
> Key: HDFS-872
> URL: https://issues.apache.org/jira/browse/HDFS-872
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.20.1, 0.20.2
> Reporter: Bassam Tabbara
> Fix For: 0.20.2
>
> Attachments: hdfs-872.txt
>
>
> After upgrading to that latest HDFS 0.20.2 (r896310 from
> /branches/branch-0.20), old DFS clients (0.20.1) seem to not work anymore.
> HBase uses the 0.20.1 hadoop core jars and the HBase master will no longer
> startup. Here is the exception from the HBase master log:
> {code}
> 2010-01-06 09:59:46,762 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
> java.io.IOException: Could not obtain block: blk_338051
> 2596555557728_1002 file=/hbase/hbase.version
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1788)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1673)
> at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:320)
> at java.io.DataInputStream.readUTF(DataInputStream.java:572)
> at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:189)
> at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:208)
> at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:208)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1241)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1282)
> 2010-01-06 09:59:46,763 FATAL org.apache.hadoop.hbase.master.HMaster: Not
> starting HMaster because:
> java.io.IOException: Could not obtain block: blk_3380512596555557728_1002
> file=/hbase/hbase.version
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1788)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1616)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1673)
> at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:320)
> at java.io.DataInputStream.readUTF(DataInputStream.java:572)
> at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:189)
> at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:208)
> at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:208)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1241)
> at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1282)
> {code}
> If I switch the hadoop jars in the hbase/lib directory with 0.20.2 version it
> works well, which what led me to open this bug here and not in the HBASE
> project.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.