[
https://issues.apache.org/jira/browse/HDFS-14027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665889#comment-16665889
]
Xiao Chen edited comment on HDFS-14027 at 10/29/18 2:14 AM:
------------------------------------------------------------
Thanks for the additional review Daniel. Also thanks Imran for the comment.
Made it debug in patch 3, and slightly modified the text
({{DFSStripedOutputStream}} should be explicit enough about the 'EC' part IMO).
Feel free to share the desired English and I'd be happy to update. (Also fixed
the deprecated warning to make precommit happy)
was (Author: xiaochen):
Thanks for the additional review Daniel. Also thanks Imran for the comment.
Made it debug in patch 3, and slightly modified the text
({{DFSStripedOutputStream}} should be explicit enough about the 'EC' part IMO).
Feel free to share the desired English and I'd be happy to update.
> DFSStripedOutputStream should implement both hsync methods
> ----------------------------------------------------------
>
> Key: HDFS-14027
> URL: https://issues.apache.org/jira/browse/HDFS-14027
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: erasure-coding
> Affects Versions: 3.0.0
> Reporter: Xiao Chen
> Assignee: Xiao Chen
> Priority: Critical
> Attachments: HDFS-14027.01.patch, HDFS-14027.02.patch,
> HDFS-14027.03.patch
>
>
> In an internal spark investigation, it appears that when
> [EventLoggingListener|https://github.com/apache/spark/blob/7251be0c04f0380208e0197e559158a9e1400868/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L155]
> writes to an EC file, one may get exceptions reading, or get odd outputs. A
> sample exception is
> {noformat}
> hdfs dfs -cat /user/spark/applicationHistory/application_1540333573846_0003 |
> head -1
> 18/10/23 18:12:39 WARN impl.BlockReaderFactory: I/O error constructing remote
> block reader.
> java.io.IOException: Got error, status=ERROR, status message opReadBlock
> BP-1488936467-HOST_IP-1540333392519:blk_-9223372036854774960_1085 received
> exception java.io.IOException: Offset 0 and length 116161 don't match block
> BP-1488936467-HOST_IP-1540333392519:blk_-9223372036854774960_1085 ( blockLen
> 110296 ), for OP_READ_BLOCK, self=/HOST_IP:48610, remote=/HOST2_IP:20002, for
> file /user/spark/applicationHistory/application_1540333573846_0003, for pool
> BP-1488936467-HOST_IP-1540333392519 block -9223372036854774960_1085
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:848)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:744)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
> at
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:264)
> at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:299)
> at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:330)
> at
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:326)
> at
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:419)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:92)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at
> org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at
> org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
> at
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:326)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:389)
> 18/10/23 18:12:39 WARN hdfs.DFSClient: Failed to connect to /HOST2_IP:20002
> for blockBP-1488936467-HOST_IP-1540333392519:blk_-9223372036854774960_1085
> java.io.IOException: Got error, status=ERROR, status message opReadBlock
> BP-1488936467-HOST_IP-1540333392519:blk_-9223372036854774960_1085 received
> exception java.io.IOException: Offset 0 and length 116161 don't match block
> BP-1488936467-HOST_IP-1540333392519:blk_-9223372036854774960_1085 ( blockLen
> 110296 ), for OP_READ_BLOCK, self=/HOST_IP:48610, remote=/HOST2_IP:20002, for
> file /user/spark/applicationHistory/application_1540333573846_0003, for pool
> BP-1488936467-HOST_IP-1540333392519 block -9223372036854774960_1085
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:848)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:744)
> at
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
> at
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:264)
> at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:299)
> at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:330)
> at
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:326)
> at
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:419)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:92)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at
> org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at
> org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
> at
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:326)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:389)
> 18/10/23 18:12:39 WARN hdfs.DFSClient:
> [DatanodeInfoWithStorage[HOST2_IP:20002,DS-f5bc0566-eeb0-43aa-84b9-551a3a6d01a6,DISK]]
> are unavailable and all striping blocks on them are lost. IgnoredNodes = null
> {"Event":"SparkListenerLogStart","Spark Version":"2.4.0-cdh6.x-SNAPSHOT"}
> {noformat}
> Also, there are clearly {{fsync}} logs in NN for the file.
> Looking from code, the only way this can happen is through the {{hsync}}
> overload on {{DFSStripedOutputStream}}. We should make that consistent with
> the {{hsync}} without parameters. It seems this was simply missed from day0
> implementation in HDFS-7889.
> Credit to [~irashid] for investigating this issue from Spark
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]