[ 
https://issues.apache.org/jira/browse/HDFS-15957?focusedWorklogId=579803&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579803
 ]

ASF GitHub Bot logged work on HDFS-15957:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Apr/21 06:49
            Start Date: 09/Apr/21 06:49
    Worklog Time Spent: 10m 
      Work Description: hadoop-yetus commented on pull request #2878:
URL: https://github.com/apache/hadoop/pull/2878#issuecomment-816454504


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |  31m 39s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  36m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 24s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  trunk passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 17s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  18m 54s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 17s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 59s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  the patch passed with JDK 
Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 15s |  |  the patch passed with JDK 
Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 20s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 55s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 382m  4s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 507m 32s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
   |   | hadoop.hdfs.TestDFSShell |
   |   | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
   |   | hadoop.hdfs.TestLeaseRecovery |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList |
   |   | hadoop.hdfs.server.namenode.snapshot.TestNestedSnapshots |
   |   | hadoop.hdfs.server.datanode.TestBlockScanner |
   |   | hadoop.hdfs.TestViewDistributedFileSystemWithMountLinks |
   |   | 
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
   |   | hadoop.hdfs.TestSnapshotCommands |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.TestPersistBlocks |
   |   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
   |   | hadoop.hdfs.TestStateAlignmentContextWithHA |
   |   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
   |   | hadoop.hdfs.server.diskbalancer.TestDiskBalancer |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.namenode.TestProcessCorruptBlocks |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2878 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux ba58ce5c8970 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 794ee7514b273526b8c28feb5f8344b8fe5b9bea |
   | Default Java | Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/testReport/ |
   | Max. process+thread count | 2036 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2878/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 579803)
    Time Spent: 20m  (was: 10m)

> The ignored IOException in the RPC response sent by FSEditLogAsync can cause 
> the HDFS client to hang
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15957
>                 URL: https://issues.apache.org/jira/browse/HDFS-15957
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs async, namenode
>    Affects Versions: 3.2.2
>            Reporter: Haoze Wu
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: fsshell.txt, namenode.txt, reproduce.patch, 
> secondnamenode.txt
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
>     In FSEditLogAsync, the RpcEdit notification in line 248 could be skipped, 
> because the possible exception (e.g., IOException) thrown in line 365 is 
> always ignored.
>  
> {code:java}
> //hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
> class FSEditLogAsync extends FSEditLog implements Runnable {
>   // ...
>   @Override
>   public void run() {
>     try {
>       while (true) {
>         boolean doSync;
>         Edit edit = dequeueEdit();
>         if (edit != null) {
>           // sync if requested by edit log.
>           doSync = edit.logEdit();
>           syncWaitQ.add(edit);
>         } else {
>           // sync when editq runs dry, but have edits pending a sync.
>           doSync = !syncWaitQ.isEmpty();
>         }
>         if (doSync) {
>           // normally edit log exceptions cause the NN to terminate, but tests
>           // relying on ExitUtil.terminate need to see the exception.
>           RuntimeException syncEx = null;
>           try {
>             logSync(getLastWrittenTxId());
>           } catch (RuntimeException ex) {
>             syncEx = ex;
>           }
>           while ((edit = syncWaitQ.poll()) != null) {
>             edit.logSyncNotify(syncEx);                               // line 
> 248
>           }
>         }
>       }
>     } catch (InterruptedException ie) {
>       LOG.info(Thread.currentThread().getName() + " was interrupted, 
> exiting");
>     } catch (Throwable t) {
>       terminate(t);
>     }
>   }
>   // the calling rpc thread will return immediately from logSync but the
>   // rpc response will not be sent until the edit is durable.
>   private static class RpcEdit extends Edit {
>     // ...
>     @Override
>     public void logSyncNotify(RuntimeException syncEx) {
>       try {
>         if (syncEx == null) {
>           call.sendResponse();                                        // line 
> 365
>         } else {
>           call.abortResponse(syncEx);
>         }
>       } catch (Exception e) {} // don't care if not sent.
>     }
>   }
> }
> {code}
>     The `call.sendResponse()` may throw an IOException. According to the 
> comment (“don’t care if not sent”) there, this exception is neither handled 
> nor printed in log. However, we suspect that some RPC responses sent there 
> may be critical, and there should be some retry mechanism.
>     We try to introduce a single IOException in line 365, and find that the 
> HDFS client (e.g., `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`) may get 
> stuck forever (hang for >30min without any log). We can reproduce this 
> symptom in multiple ways. One of the simplest ways of reproduction is shown 
> as follows:
>  # Start a new empty HDFS cluster (1 namenode, 2 datanodes) with the default 
> configuration.
>  # Generate a file of 15MB for testing, by `fallocate -l 15000000 foo.txt`.
>  # Run the HDFS client `bin/hdfs dfs -copyFromLocal ./foo.txt /1.txt`.
>  # When line 365 is invoked the third time (it is invoked 6 times in total in 
> this experiment), inject an IOException there. (A patch for injecting the 
> exception this way is attached to reproduce the issue)
>     Then the client hangs forever, without any log. If we run `bin/hdfs dfs 
> -ls /` to check the file status, we can not see the expected 15MB `/1.txt` 
> file.
>     The jstack of the HDFS client shows that there is an RPC call infinitely 
> waiting.
> {code:java}
> "Thread-6" #18 daemon prio=5 os_prio=0 tid=0x00007f9cd5295800 nid=0x26b9 in 
> Object.wait() [0x00007f9ca354f000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x000000071e709610> (a org.apache.hadoop.ipc.Client$Call)
>     at java.lang.Object.wait(Object.java:502)
>     at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59)
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1556)
>     - locked <0x000000071e709610> (a org.apache.hadoop.ipc.Client$Call)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1513)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>     at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:520)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>     - locked <0x000000071e681d28> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>     at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1081)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1869)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1671)
>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)"IPC 
> Client (440737101) connection to /127.0.0.1:9000 from whz" #16 daemon prio=5 
> os_prio=0 tid=0x00007f9cd5251800 nid=0x26b3 runnable [0x00007f9ca3752000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>     - locked <0x000000071d243510> (a sun.nio.ch.Util$3)
>     - locked <0x000000071d243488> (a java.util.Collections$UnmodifiableSet)
>     - locked <0x000000071d2430c0> (a sun.nio.ch.EPollSelectorImpl)
>     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:339)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:158)
>     at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>     at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>     at java.io.FilterInputStream.read(FilterInputStream.java:133)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>     - locked <0x000000071d271828> (a java.io.BufferedInputStream)
>     at java.io.FilterInputStream.read(FilterInputStream.java:83)
>     at java.io.FilterInputStream.read(FilterInputStream.java:83)
>     at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:564)
>     at java.io.DataInputStream.readInt(DataInputStream.java:387)
>     at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1886)
>     at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1195)
>     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1090)"main" #1 
> prio=5 os_prio=0 tid=0x00007f9cd4019800 nid=0x269a in Object.wait() 
> [0x00007f9cdce5a000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x000000071e179400> (a java.util.LinkedList)
>     at java.lang.Object.wait(Object.java:502)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.waitAndQueuePacket(DataStreamer.java:930)
>     - locked <0x000000071e179400> (a java.util.LinkedList)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.enqueueCurrentPacket(DFSOutputStream.java:483)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.enqueueCurrentPacketFull(DFSOutputStream.java:492)
>     - locked <0x000000071e01fb48> (a org.apache.hadoop.hdfs.DFSOutputStream)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.writeChunk(DFSOutputStream.java:433)
>     - locked <0x000000071e01fb48> (a org.apache.hadoop.hdfs.DFSOutputStream)
>     at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
>     at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125)
>     at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111)
>     - locked <0x000000071e01fb48> (a org.apache.hadoop.hdfs.DFSOutputStream)
>     at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
>     at java.io.DataOutputStream.write(DataOutputStream.java:107)
>     - locked <0x000000071e267270> (a 
> org.apache.hadoop.hdfs.client.HdfsDataOutputStream)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:96)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
>     at 
> org.apache.hadoop.fs.shell.CopyCommands$Put.copyFile(CopyCommands.java:335)
>     at 
> org.apache.hadoop.fs.shell.CopyCommands$Put.copyFileToTarget(CopyCommands.java:343)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
>     at 
> org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
>     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
>     at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:257)
>     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
>     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
>     at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:228)
>     at 
> org.apache.hadoop.fs.shell.CopyCommands$Put.processArguments(CopyCommands.java:303)
>     at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
>     at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
>     at org.apache.hadoop.fs.FsShell.run(FsShell.java:327)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>     at org.apache.hadoop.fs.FsShell.main(FsShell.java:390)
> {code}
>     The jstack for the namenode shows:
> {code:java}
> "FSEditLogAsync" #44 prio=5 os_prio=0 tid=0x00007f1af5802800 nid=0x45f0 
> waiting on condition [0x00007f1ac3090000]
>    java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00000005d6b1e198> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.dequeueEdit(FSEditLogAsync.java:221)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:230)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>     The full jstack files for the client, namenode and secondary namenode are 
> attached.We are not sure about the exact causal chain for the client hanging.
>     However, we find that adding a retry logic in the response sending will 
> fix the hanging issue. We will raise a pull request based on this retry logic.
>     We also attach the jstack of fsshell, namenode, and secondarynamenode for 
> reference.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to