[ 
https://issues.apache.org/jira/browse/HADOOP-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688652#comment-13688652
 ] 

Jagane Sundar commented on HADOOP-9654:
---------------------------------------

Roman - pardon me if you already know this and are configuring your BigTop test 
correctly. If you take a look at HDFS-4646 and HDFS-4858, I have observed 
similar failure to timeout issues with both the HDFS Client to NameNode ipc 
(HDFS-4646) and the Datanode to NameNode ipc (HDFS-4858).

By default ipc.client.ping is true. The meaning of this is that the IPC layer 
is to send out a periodic ping but to never timeout.

In order to timeout, ipc.client.ping needs to be configured false and 
ipc.ping.interval needs to be set to some value e.g. 14000. This configuration 
means that the IPC Client should timeout in 14000. Is BigTop configuring hadoop 
so?

                
> IPC timeout doesn't seem to be kicking in
> -----------------------------------------
>
>                 Key: HADOOP-9654
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9654
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.1.0-beta
>            Reporter: Roman Shaposhnik
>
> During my Bigtop testing I made the NN OOM. This, in turn, made all of the 
> clients stuck in the IPC call (even the new clients that I run *after* the NN 
> went OOM). Here's an example of a jstack output on the client that was 
> running:
> {noformat}
> $ hadoop fs -lsr /
> {noformat}
> Stacktrace:
> {noformat}
> /usr/java/jdk1.6.0_21/bin/jstack 19078
> 2013-06-19 23:14:00
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode):
> "Attach Listener" daemon prio=10 tid=0x00007fcd8c8c1800 nid=0x5105 waiting on 
> condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "IPC Client (1223039541) connection to 
> ip-10-144-82-213.ec2.internal/10.144.82.213:17020 from root" daemon prio=10 
> tid=0x00007fcd8c7ea000 nid=0x4aa0 runnable [0x00007fcd443e2000]
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>       - locked <0x00007fcd7529de18> (a sun.nio.ch.Util$1)
>       - locked <0x00007fcd7529de00> (a java.util.Collections$UnmodifiableSet)
>       - locked <0x00007fcd7529da80> (a sun.nio.ch.EPollSelectorImpl)
>       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>       at java.io.FilterInputStream.read(FilterInputStream.java:116)
>       at java.io.FilterInputStream.read(FilterInputStream.java:116)
>       at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:421)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>       - locked <0x00007fcd752aaf18> (a java.io.BufferedInputStream)
>       at java.io.DataInputStream.readInt(DataInputStream.java:370)
>       at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:943)
>       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:840)
> "Low Memory Detector" daemon prio=10 tid=0x00007fcd8c090000 nid=0x4a9b 
> runnable [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "CompilerThread1" daemon prio=10 tid=0x00007fcd8c08d800 nid=0x4a9a waiting on 
> condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "CompilerThread0" daemon prio=10 tid=0x00007fcd8c08a800 nid=0x4a99 waiting on 
> condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "Signal Dispatcher" daemon prio=10 tid=0x00007fcd8c088800 nid=0x4a98 runnable 
> [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "Finalizer" daemon prio=10 tid=0x00007fcd8c06a000 nid=0x4a97 in Object.wait() 
> [0x00007fcd902e9000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00007fcd75fc0470> (a java.lang.ref.ReferenceQueue$Lock)
>       at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>       - locked <0x00007fcd75fc0470> (a java.lang.ref.ReferenceQueue$Lock)
>       at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>       at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=10 tid=0x00007fcd8c068000 nid=0x4a96 in 
> Object.wait() [0x00007fcd903ea000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00007fcd75fc0550> (a java.lang.ref.Reference$Lock)
>       at java.lang.Object.wait(Object.java:485)
>       at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>       - locked <0x00007fcd75fc0550> (a java.lang.ref.Reference$Lock)
> "main" prio=10 tid=0x00007fcd8c00a800 nid=0x4a92 in Object.wait() 
> [0x00007fcd91b06000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00007fcd752528e8> (a org.apache.hadoop.ipc.Client$Call)
>       at java.lang.Object.wait(Object.java:485)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1284)
>       - locked <0x00007fcd752528e8> (a org.apache.hadoop.ipc.Client$Call)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1250)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:204)
>       at $Proxy9.getFileInfo(Unknown Source)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>       at $Proxy9.getFileInfo(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:649)
>       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1599)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:838)
>       at 
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1684)
>       at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1630)
>       at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1605)
>       at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
>       at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
>       at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
>       at 
> org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
>       at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
>       at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>       at org.apache.hadoop.fs.FsShell.main(FsShell.java:305)
> "VM Thread" prio=10 tid=0x00007fcd8c064000 nid=0x4a95 runnable 
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fcd8c01d800 nid=0x4a93 
> runnable 
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fcd8c01f800 nid=0x4a94 
> runnable 
> "VM Periodic Task Thread" prio=10 tid=0x00007fcd8c09a800 nid=0x4a9c waiting 
> on condition 
> JNI global references: 1086
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to