[ https://issues.apache.org/jira/browse/HADOOP-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645091#action_12645091 ]
Jason commented on HADOOP-3914: ------------------------------- We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine) hadoop dfs -text path_to_directory/'*' 08/11/04 14:08:32 [main] INFO fs.FSInputChecker: java.io.IOException: Checksum ok was sent and should not be sent again at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1392) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.io.SequenceFile$Metadata.readFields(SequenceFile.java:725) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1511) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415) at org.apache.hadoop.fs.FsShell$TextRecordInputStream.<init>(FsShell.java:365) at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:403) at org.apache.hadoop.fs.FsShell.access$200(FsShell.java:49) at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865) at org.apache.hadoop.fs.FsShell.text(FsShell.java:413) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) > checksumOk implementation in DFSClient can break applications > ------------------------------------------------------------- > > Key: HADOOP-3914 > URL: https://issues.apache.org/jira/browse/HADOOP-3914 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.17.1 > Reporter: Christian Kunz > Assignee: Christian Kunz > Priority: Blocker > Fix For: 0.18.2 > > Attachments: checksumOk.patch, checksumOk1-br18.patch, > checksumOk1.patch, patch.HADOOP-3914 > > > One of our non-map-reduce applications (written in C and using libhdfs to > access dfs) stopped working after switch from 0.16 to 0.17. > The problem was finally traced down to failures in checksumOk. > I would assume, the purpose of checksumOk is for a DfsClient to indicate to a > sending Datanode that the checksum of the received block is okay. This must > be useful in the replication pipeline. > How checksumOk is implemented is that any IOException is caught and ignored, > probably because it is not essential for the client that the message is > successful. > But it proved fatal for our application because this application links in a > 3rd-party library which seems to catch socket exceptions before libhdfs. > Why was there an Exception? In our case the application reads a file into the > local buffer of the DFSInputStream large enough to hold all data, the > application reads to the end and the checksumOK is sent successfully at that > time. But then the application does some other stuff and comes back to > re-read the file (still open). It is then when it calls checksumOk again and > crashes. > It can easily be avoided by adding a Boolean making sure that checksumOk is > called exactly once when EOS is encountered. Redundant calls to checksumOk do > not seem to make sense anyhow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.