[
https://issues.apache.org/jira/browse/FLINK-13228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891681#comment-16891681
]
Yu Li commented on FLINK-13228:
-------------------------------
>From current investigation, the most weird part is that from travis log,
>there's an additional data flush in {{DataStreamer}} as indicated below:
{noformat}
23:31:07,549 INFO org.apache.hadoop.hdfs.StateChange
- DIR* completeFile:
/tests/qoadqaraomqogyax/.part-0.inprogress.748c58f2-4db8-4b2f-a10a-e5ea09410b7e
is closed by DFSClient_NONMAPREDUCE_-1957051875_1
23:31:07,552 WARN org.apache.hadoop.hdfs.DataStreamer
- DataStreamer Exception
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:766)
13:33:59,846 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit
- allowed=true ugi=travis (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo
src=/tests/fsledeprirhvkkht/.part-0.inprogress.fd9bc50d-96aa-48bc-86e1-8662382b53c9
dst=null perm=null proto=rpc
13:33:59,848 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit
- allowed=true ugi=travis (auth:SIMPLE) ip=/127.0.0.1 cmd=rename
src=/tests/fsledeprirhvkkht/.part-0.inprogress.fd9bc50d-96aa-48bc-86e1-8662382b53c9
dst=/tests/fsledeprirhvkkht/part-0 perm=travis:supergroup:rw-r--r--
proto=rpc
13:33:59,849 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit
- allowed=true ugi=travis (auth:SIMPLE) ip=/127.0.0.1 cmd=listStatus
src=/tests/fsledeprirhvkkht dst=null perm=null proto=rpc
13:33:59,850 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit
- allowed=true ugi=travis (auth:SIMPLE) ip=/127.0.0.1 cmd=open
src=/tests/fsledeprirhvkkht/part-0 dst=null perm=null
proto=rpc
13:33:59,855 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit
- allowed=true ugi=travis (auth:SIMPLE) ip=/127.0.0.1 cmd=delete
src=/tests/fsledeprirhvkkht dst=null perm=null proto=rpc
{noformat}
However, when debugging with the attached hadoop patch, we could confirm the
data will be split into two packets in our test case, and if any of them failed
to write, we will see below exception:
{noformat}
[DataStreamer for file
/tests/yufomllbjxlhztth/.part-0.inprogress.fca749ce-ce9d-4fef-9f08-3937358b4c05
block BP-1698089270-127.0.0.1-1563951866468:blk_1073741825_1001] WARN
org.apache.hadoop.hdfs.DataStreamer - DataStreamer Exception
java.nio.channels.ClosedByInterruptException
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:77)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:179)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:137)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:766)
[DataXceiver for client DFSClient_NONMAPREDUCE_-1981706570_1 at
/127.0.0.1:51159 [Receiving block
BP-1698089270-127.0.0.1-1563951866468:blk_1073741825_1001]] INFO
org.apache.hadoop.hdfs.server.datanode.DataNode - Exception for
BP-1698089270-127.0.0.1-1563951866468:blk_1073741825_1001
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:208)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:211)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:521)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
at java.lang.Thread.run(Thread.java:745)
Test
testCommitAfterNormalClose(org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriterTest)
failed with:
java.io.IOException: All datanodes
[DatanodeInfoWithStorage[127.0.0.1:51147,DS-8f7fb1af-ce7d-4547-b50b-a53f00c40408,DISK]]
are bad. Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
{noformat}
But if we throw {{ClosedByInterruptException}} after the two packets completes,
we will see no additional flush happened to {{DataStreamer}}, as indicated by
below log:
{noformat}
24794 [PacketResponder:
BP-695740659-127.0.0.1-1563954445528:blk_1073741825_1001,
type=LAST_IN_PIPELINE] INFO org.apache.hadoop.hdfs.server.datanode.DataNode -
PacketResponder: BP-695740659-127.0.0.1-1563954445528:blk_1073741825_1001,
type=LAST_IN_PIPELINE terminating
24795 [IPC Server handler 4 on 53181] INFO org.apache.hadoop.hdfs.StateChange
- DIR* completeFile:
/tests/ujoamxgnvmlapclr/.part-0.inprogress.e3bcef26-5811-495b-961b-1c65b20acccf
is closed by DFSClient_NONMAPREDUCE_1297545335_1
24796 [IPC Server handler 5 on 53181] INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit - allowed=true
ugi=jueding (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo
src=/tests/ujoamxgnvmlapclr/.part-0.inprogress.e3bcef26-5811-495b-961b-1c65b20acccf
dst=null perm=null proto=rpc
24797 [IPC Server handler 6 on 53181] INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit - allowed=true
ugi=jueding (auth:SIMPLE) ip=/127.0.0.1 cmd=rename
src=/tests/ujoamxgnvmlapclr/.part-0.inprogress.e3bcef26-5811-495b-961b-1c65b20acccf
dst=/tests/ujoamxgnvmlapclr/part-0 perm=jueding:supergroup:rw-r--r--
proto=rpc
24798 [IPC Server handler 7 on 53181] INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit - allowed=true
ugi=jueding (auth:SIMPLE) ip=/127.0.0.1 cmd=listStatus
src=/tests/ujoamxgnvmlapclr dst=null perm=null proto=rpc
24799 [IPC Server handler 8 on 53181] INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit - allowed=true
ugi=jueding (auth:SIMPLE) ip=/127.0.0.1 cmd=open
src=/tests/ujoamxgnvmlapclr/part-0 dst=null perm=null
proto=rpc
{noformat}
Will try to figure out where the additional flush comes from and I believe we
could reproduce and resolve the issue once this is located.
> HadoopRecoverableWriterTest.testCommitAfterNormalClose fails on Travis
> ----------------------------------------------------------------------
>
> Key: FLINK-13228
> URL: https://issues.apache.org/jira/browse/FLINK-13228
> Project: Flink
> Issue Type: Bug
> Components: FileSystems
> Affects Versions: 1.9.0
> Reporter: Till Rohrmann
> Assignee: Yu Li
> Priority: Critical
> Labels: test-stability
> Fix For: 1.9.0
>
> Attachments: FLINK-13228.hadoop.debug.patch
>
>
> {{HadoopRecoverableWriterTest.testCommitAfterNormalClose}} failed on Travis
> with
> {code}
> HadoopRecoverableWriterTest.testCommitAfterNormalClose » IO The stream is
> closed
> {code}
> https://api.travis-ci.org/v3/job/557293706/log.txt
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)