[ 
https://issues.apache.org/jira/browse/FLINK-13228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891681#comment-16891681
 ] 

Yu Li commented on FLINK-13228:
-------------------------------

>From current investigation, the most weird part is that from travis log, 
>there's an additional data flush in {{DataStreamer}} as indicated below:
{noformat}
23:31:07,549 INFO  org.apache.hadoop.hdfs.StateChange                           
 - DIR* completeFile: 
/tests/qoadqaraomqogyax/.part-0.inprogress.748c58f2-4db8-4b2f-a10a-e5ea09410b7e 
is closed by DFSClient_NONMAPREDUCE_-1957051875_1
23:31:07,552 WARN  org.apache.hadoop.hdfs.DataStreamer                          
 - DataStreamer Exception
java.nio.channels.ClosedByInterruptException
        at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:766)
13:33:59,846 INFO  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit    
 - allowed=true ugi=travis (auth:SIMPLE)        ip=/127.0.0.1   cmd=getfileinfo 
src=/tests/fsledeprirhvkkht/.part-0.inprogress.fd9bc50d-96aa-48bc-86e1-8662382b53c9
     dst=null        perm=null       proto=rpc
13:33:59,848 INFO  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit    
 - allowed=true ugi=travis (auth:SIMPLE)        ip=/127.0.0.1   cmd=rename      
src=/tests/fsledeprirhvkkht/.part-0.inprogress.fd9bc50d-96aa-48bc-86e1-8662382b53c9
     dst=/tests/fsledeprirhvkkht/part-0      perm=travis:supergroup:rw-r--r--   
     proto=rpc
13:33:59,849 INFO  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit    
 - allowed=true ugi=travis (auth:SIMPLE)        ip=/127.0.0.1   cmd=listStatus  
src=/tests/fsledeprirhvkkht     dst=null        perm=null       proto=rpc
13:33:59,850 INFO  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit    
 - allowed=true ugi=travis (auth:SIMPLE)        ip=/127.0.0.1   cmd=open        
src=/tests/fsledeprirhvkkht/part-0      dst=null        perm=null       
proto=rpc
13:33:59,855 INFO  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit    
 - allowed=true ugi=travis (auth:SIMPLE)        ip=/127.0.0.1   cmd=delete      
src=/tests/fsledeprirhvkkht     dst=null        perm=null       proto=rpc

{noformat}

However, when debugging with the attached hadoop patch, we could confirm the 
data will be split into two packets in our test case, and if any of them failed 
to write, we will see below exception:
{noformat}
[DataStreamer for file 
/tests/yufomllbjxlhztth/.part-0.inprogress.fca749ce-ce9d-4fef-9f08-3937358b4c05 
block BP-1698089270-127.0.0.1-1563951866468:blk_1073741825_1001] WARN  
org.apache.hadoop.hdfs.DataStreamer  - DataStreamer Exception
java.nio.channels.ClosedByInterruptException
        at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:77)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:179)
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:137)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:766)
[DataXceiver for client DFSClient_NONMAPREDUCE_-1981706570_1 at 
/127.0.0.1:51159 [Receiving block 
BP-1698089270-127.0.0.1-1563951866468:blk_1073741825_1001]] INFO  
org.apache.hadoop.hdfs.server.datanode.DataNode  - Exception for 
BP-1698089270-127.0.0.1-1563951866468:blk_1073741825_1001
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:208)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:211)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:521)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
        at java.lang.Thread.run(Thread.java:745)
Test 
testCommitAfterNormalClose(org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriterTest)
 failed with:
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[127.0.0.1:51147,DS-8f7fb1af-ce7d-4547-b50b-a53f00c40408,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
{noformat}

But if we throw {{ClosedByInterruptException}} after the two packets completes, 
we will see no additional flush happened to {{DataStreamer}}, as indicated by 
below log:
{noformat}
24794 [PacketResponder: 
BP-695740659-127.0.0.1-1563954445528:blk_1073741825_1001, 
type=LAST_IN_PIPELINE] INFO  org.apache.hadoop.hdfs.server.datanode.DataNode  - 
PacketResponder: BP-695740659-127.0.0.1-1563954445528:blk_1073741825_1001, 
type=LAST_IN_PIPELINE terminating
24795 [IPC Server handler 4 on 53181] INFO  org.apache.hadoop.hdfs.StateChange  
- DIR* completeFile: 
/tests/ujoamxgnvmlapclr/.part-0.inprogress.e3bcef26-5811-495b-961b-1c65b20acccf 
is closed by DFSClient_NONMAPREDUCE_1297545335_1
24796 [IPC Server handler 5 on 53181] INFO  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit  - allowed=true   
ugi=jueding (auth:SIMPLE)       ip=/127.0.0.1   cmd=getfileinfo 
src=/tests/ujoamxgnvmlapclr/.part-0.inprogress.e3bcef26-5811-495b-961b-1c65b20acccf
     dst=null        perm=null       proto=rpc
24797 [IPC Server handler 6 on 53181] INFO  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit  - allowed=true   
ugi=jueding (auth:SIMPLE)       ip=/127.0.0.1   cmd=rename      
src=/tests/ujoamxgnvmlapclr/.part-0.inprogress.e3bcef26-5811-495b-961b-1c65b20acccf
     dst=/tests/ujoamxgnvmlapclr/part-0      perm=jueding:supergroup:rw-r--r--  
     proto=rpc
24798 [IPC Server handler 7 on 53181] INFO  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit  - allowed=true   
ugi=jueding (auth:SIMPLE)       ip=/127.0.0.1   cmd=listStatus  
src=/tests/ujoamxgnvmlapclr     dst=null        perm=null       proto=rpc
24799 [IPC Server handler 8 on 53181] INFO  
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit  - allowed=true   
ugi=jueding (auth:SIMPLE)       ip=/127.0.0.1   cmd=open        
src=/tests/ujoamxgnvmlapclr/part-0      dst=null        perm=null       
proto=rpc
{noformat}

Will try to figure out where the additional flush comes from and I believe we 
could reproduce and resolve the issue once this is located.

> HadoopRecoverableWriterTest.testCommitAfterNormalClose fails on Travis
> ----------------------------------------------------------------------
>
>                 Key: FLINK-13228
>                 URL: https://issues.apache.org/jira/browse/FLINK-13228
>             Project: Flink
>          Issue Type: Bug
>          Components: FileSystems
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Assignee: Yu Li
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.9.0
>
>         Attachments: FLINK-13228.hadoop.debug.patch
>
>
> {{HadoopRecoverableWriterTest.testCommitAfterNormalClose}} failed on Travis 
> with
> {code}
> HadoopRecoverableWriterTest.testCommitAfterNormalClose » IO The stream is 
> closed
> {code}
> https://api.travis-ci.org/v3/job/557293706/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to