[jira] [Comment Edited] (HDDS-2376) Fail to read data through XceiverClientGrpc
[ https://issues.apache.org/jira/browse/HDDS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964661#comment-16964661 ] Sammi Chen edited comment on HDDS-2376 at 11/1/19 6:56 AM: --- The root cause is I didn't retart Hadoop 2.7.5 after I deploied the latest Ozone binary. So the Hadoop still use an old version Ozone client(2 month before) . This OzoneChecksumException is thrown out by NodeManager. Logs attached. It seems something is changed in Ozone server side, which makes an old version Ozone client cann't verify the data written by itself. [~msingh] and [~hanishakoneru], thanks for pay attention to this issue. I will close it now. 2019-11-01 11:46:02,230 ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command cmdType: ReadChunk traceID: "" containerID: 1145 datanodeUuid: "ed90869c-317e-4303-8922-9fa83a3983cb" readChunk { blockID { containerID: 1145 localID: 103060600027086850 blockCommitSequenceId: 948 } chunkData { chunkName: "103060600027086850_chunk_1" offset: 0 len: 245 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: "\247\304Yf" } } } on datanode 1da74a1d-f64d-4ad4-b04c-85f26687e683 org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at index 0 at org.apache.hadoop.ozone.common.ChecksumData.verifyChecksumDataMatches(ChecksumData.java:148) at org.apache.hadoop.ozone.common.Checksum.verifyChecksum(Checksum.java:275) at org.apache.hadoop.ozone.common.Checksum.verifyChecksum(Checksum.java:238) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.lambda$new$0(ChunkInputStream.java:375) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:287) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:250) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:233) at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:245) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:335) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:307) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:259) at org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:144) at org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:239) at org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:171) at org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:86) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:60) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:120) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-11-01 11:46:02,243 ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command cmdType: ReadChunk traceID: "" containerID: 1145 datanodeUuid: "ed90869c-317e-4303-8922-9fa83a3983cb" readChunk { blockID { containerID: 1145 localID: 103060600027086850 blockCommitSequenceId: 948 } chunkData { chunkName: "103060600027086850_chunk_1" offset: 0 len: 245 checksumData { type: CRC32 bytesPerChecksum: 1048576 checksums: "\247\304Yf" } } } on datanode ed90869c-317e-4303-8922-9fa83a3983cb org.apache.hadoop.ozone.common.OzoneChecksumException:
[jira] [Comment Edited] (HDDS-2376) Fail to read data through XceiverClientGrpc
[ https://issues.apache.org/jira/browse/HDDS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963835#comment-16963835 ] Istvan Fajth edited comment on HDDS-2376 at 10/31/19 10:08 AM: --- Hi [~Sammi], I have ran into a similar exception in one of our test environments, while I was preparing some testing, this appeared after I have updated the Ozone and Ratis jars on the cluster, and couldn't get to the bottom of it as there were some other minor changes as well, and after rewriting the data everything started to work properly, then I couldn't get to a reproduction so far. Could the same happen on your side? Were there an update on Ozone after which you started to see this? was (Author: pifta): Hi [~Sammi], I have run into a similar exception in one of our test environments, while I was preparing some testing, this appeared after I have updated the Ozone and Ratis jars on the cluster, and couldn't get to the bottom of it as there were some other minor changes as well, and after rewriting the data everything started to work properly, then I couldn't get to a reproduction so far. Could the same happen on your side? Were there an update on Ozone after which you started to see this? > Fail to read data through XceiverClientGrpc > --- > > Key: HDDS-2376 > URL: https://issues.apache.org/jira/browse/HDDS-2376 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Sammi Chen >Assignee: Hanisha Koneru >Priority: Blocker > > Run teragen, application failed with following stack, > 19/10/29 14:35:42 INFO mapreduce.Job: Running job: job_1567133159094_0048 > 19/10/29 14:35:59 INFO mapreduce.Job: Job job_1567133159094_0048 running in > uber mode : false > 19/10/29 14:35:59 INFO mapreduce.Job: map 0% reduce 0% > 19/10/29 14:35:59 INFO mapreduce.Job: Job job_1567133159094_0048 failed with > state FAILED due to: Application application_1567133159094_0048 failed 2 > times due to AM Container for appattempt_1567133159094_0048_02 exited > with exitCode: -1000 > For more detailed output, check application tracking > page:http://host183:8088/cluster/app/application_1567133159094_0048Then, > click on links to logs of each attempt. > Diagnostics: Unexpected OzoneException: > org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at > index 0 > java.io.IOException: Unexpected OzoneException: > org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at > index 0 > at > org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:342) > at > org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:307) > at > org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:259) > at > org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:144) > at > org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:239) > at > org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:171) > at > org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:86) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:60) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:120) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267) > at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) > at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) > at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum > mismatch at index 0 > at >