checksumOk implementation in DFSClient can break applications
-------------------------------------------------------------

                 Key: HADOOP-3914
                 URL: https://issues.apache.org/jira/browse/HADOOP-3914
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.17.1
            Reporter: Christian Kunz


One of our non-map-reduce applications (written in C and using libhdfs to 
access dfs) stopped working after switch from 0.16 to 0.17.
The problem was finally traced down to failures in checksumOk.

I would assume, the purpose of checksumOk is for a DfsClient to indicate to a 
sending Datanode that the checksum of the received block is okay. This must be 
useful in the replication pipeline.
How checksumOk is implemented is that any IOException is caught and ignored, 
probably because it is not essential for the client that the message is 
successful.

But it proved fatal for our application because this application links in a 
3rd-party library which seems to catch socket exceptions before libhdfs.

Why was there an Exception? In our case the application reads a file into the 
local buffer of the DFSInputStream large enough to hold all data, the 
application reads to the end  and the checksumOK is sent successfully at that 
time. But then the application does some other stuff and comes back to re-read 
the file (still open). It is then when it calls checksumOk again and crashes.

It can easily be avoided by adding a Boolean making sure that checksumOk is 
called exactly once when EOS is encountered. Redundant calls to checksumOk do 
not seem to make sense anyhow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to