[jira] [Comment Edited] (HDDS-2376) Fail to read data through XceiverClientGrpc

2019-11-01 Thread Sammi Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964661#comment-16964661
 ] 

Sammi Chen edited comment on HDDS-2376 at 11/1/19 6:56 AM:
---

The root cause is I didn't retart Hadoop 2.7.5 after I deploied the latest 
Ozone binary.  So the Hadoop still use an old version Ozone client(2 month 
before) . This OzoneChecksumException is thrown out by NodeManager.  Logs 
attached.  It seems something is changed in Ozone server side, which makes an 
old version Ozone client cann't verify the data written by itself. 

[~msingh] and [~hanishakoneru], thanks for pay attention to this issue. I will 
close it now. 

2019-11-01 11:46:02,230 ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc: 
Failed to execute command cmdType: ReadChunk
traceID: ""
containerID: 1145
datanodeUuid: "ed90869c-317e-4303-8922-9fa83a3983cb"
readChunk {
  blockID {
containerID: 1145
localID: 103060600027086850
blockCommitSequenceId: 948
  }
  chunkData {
chunkName: "103060600027086850_chunk_1"
offset: 0
len: 245
checksumData {
  type: CRC32
  bytesPerChecksum: 1048576
  checksums: "\247\304Yf"
}
  }
}
 on datanode 1da74a1d-f64d-4ad4-b04c-85f26687e683
org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at 
index 0
at 
org.apache.hadoop.ozone.common.ChecksumData.verifyChecksumDataMatches(ChecksumData.java:148)
at 
org.apache.hadoop.ozone.common.Checksum.verifyChecksum(Checksum.java:275)
at 
org.apache.hadoop.ozone.common.Checksum.verifyChecksum(Checksum.java:238)
at 
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.lambda$new$0(ChunkInputStream.java:375)
at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:287)
at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:250)
at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:233)
at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:245)
at 
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:335)
at 
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:307)
at 
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:259)
at 
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:144)
at 
org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:239)
at 
org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:171)
at 
org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:86)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:60)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:120)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-11-01 11:46:02,243 ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc: 
Failed to execute command cmdType: ReadChunk
traceID: ""
containerID: 1145
datanodeUuid: "ed90869c-317e-4303-8922-9fa83a3983cb"
readChunk {
  blockID {
containerID: 1145
localID: 103060600027086850
blockCommitSequenceId: 948
  }
  chunkData {
chunkName: "103060600027086850_chunk_1"
offset: 0
len: 245
checksumData {
  type: CRC32
  bytesPerChecksum: 1048576
  checksums: "\247\304Yf"
}
  }
}
 on datanode ed90869c-317e-4303-8922-9fa83a3983cb
org.apache.hadoop.ozone.common.OzoneChecksumException: 

[jira] [Comment Edited] (HDDS-2376) Fail to read data through XceiverClientGrpc

2019-10-31 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963835#comment-16963835
 ] 

Istvan Fajth edited comment on HDDS-2376 at 10/31/19 10:08 AM:
---

Hi [~Sammi],

I have ran into a similar exception in one of our test environments, while I 
was preparing some testing, this appeared after I have updated the Ozone and 
Ratis jars on the cluster, and couldn't get to the bottom of it as there were 
some other minor changes as well, and after rewriting the data everything 
started to work properly, then I couldn't get to a reproduction so far.
Could the same happen on your side? Were there an update on Ozone after which 
you started to see this?


was (Author: pifta):
Hi [~Sammi],

I have run into a similar exception in one of our test environments, while I 
was preparing some testing, this appeared after I have updated the Ozone and 
Ratis jars on the cluster, and couldn't get to the bottom of it as there were 
some other minor changes as well, and after rewriting the data everything 
started to work properly, then I couldn't get to a reproduction so far.
Could the same happen on your side? Were there an update on Ozone after which 
you started to see this?

> Fail to read data through XceiverClientGrpc
> ---
>
> Key: HDDS-2376
> URL: https://issues.apache.org/jira/browse/HDDS-2376
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Hanisha Koneru
>Priority: Blocker
>
> Run teragen, application failed with following stack, 
> 19/10/29 14:35:42 INFO mapreduce.Job: Running job: job_1567133159094_0048
> 19/10/29 14:35:59 INFO mapreduce.Job: Job job_1567133159094_0048 running in 
> uber mode : false
> 19/10/29 14:35:59 INFO mapreduce.Job:  map 0% reduce 0%
> 19/10/29 14:35:59 INFO mapreduce.Job: Job job_1567133159094_0048 failed with 
> state FAILED due to: Application application_1567133159094_0048 failed 2 
> times due to AM Container for appattempt_1567133159094_0048_02 exited 
> with  exitCode: -1000
> For more detailed output, check application tracking 
> page:http://host183:8088/cluster/app/application_1567133159094_0048Then, 
> click on links to logs of each attempt.
> Diagnostics: Unexpected OzoneException: 
> org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at 
> index 0
> java.io.IOException: Unexpected OzoneException: 
> org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at 
> index 0
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:342)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:307)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:259)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:144)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:239)
>   at 
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:171)
>   at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
>   at java.io.DataInputStream.read(DataInputStream.java:100)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:86)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:60)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:120)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum 
> mismatch at index 0
>   at 
>