[ https://issues.apache.org/jira/browse/HADOOP-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641185#action_12641185 ]
Hairong Kuang commented on HADOOP-3914: --------------------------------------- Hi Christian, when checksumOk is called a second time, could we log it and the stack trace? As a result, we can investigate the cause of the problem if it happens again. > checksumOk implementation in DFSClient can break applications > ------------------------------------------------------------- > > Key: HADOOP-3914 > URL: https://issues.apache.org/jira/browse/HADOOP-3914 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.1 > Reporter: Christian Kunz > Assignee: Christian Kunz > Priority: Blocker > Fix For: 0.19.0 > > Attachments: patch.HADOOP-3914 > > > One of our non-map-reduce applications (written in C and using libhdfs to > access dfs) stopped working after switch from 0.16 to 0.17. > The problem was finally traced down to failures in checksumOk. > I would assume, the purpose of checksumOk is for a DfsClient to indicate to a > sending Datanode that the checksum of the received block is okay. This must > be useful in the replication pipeline. > How checksumOk is implemented is that any IOException is caught and ignored, > probably because it is not essential for the client that the message is > successful. > But it proved fatal for our application because this application links in a > 3rd-party library which seems to catch socket exceptions before libhdfs. > Why was there an Exception? In our case the application reads a file into the > local buffer of the DFSInputStream large enough to hold all data, the > application reads to the end and the checksumOK is sent successfully at that > time. But then the application does some other stuff and comes back to > re-read the file (still open). It is then when it calls checksumOk again and > crashes. > It can easily be avoided by adding a Boolean making sure that checksumOk is > called exactly once when EOS is encountered. Redundant calls to checksumOk do > not seem to make sense anyhow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.