[ 
https://issues.apache.org/jira/browse/STORM-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kellogg updated STORM-1023:
--------------------------------
    Component/s: storm-core

> Nimbus server hogs 100% CPU and clients are stuck 
> --------------------------------------------------
>
>                 Key: STORM-1023
>                 URL: https://issues.apache.org/jira/browse/STORM-1023
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 0.9.3, 0.10.0, 0.9.4, 0.11.0, 0.9.5, 0.9.6
>         Environment: Storm 0.9.5 / thrift 0.7
>            Reporter: Frantz Mazoyer
>
> Testing environment is Storm 0.9.5 / thrift java 0.7.
> Test scenario: 
>   Deploy storm topology in loop.
>   When nimbus cleanup timeout is reached, an error is thrown by thrift 
> server: 
>   "Exception while invoking ..." ... TException
> Test result:
>   Thrift java server in nimbus goes 100% CPU in infinite loop in:
> jstack:
> {code}
> "Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable 
> [0x00007fb129c9b000]
>    java.lang.Thread.State: RUNNABLE
>                                       at 
> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>                                       at 
> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>                                       at 
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>                                       at 
> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> ...
> at 
> org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284)
>  
> {code}
> strace:
> {code}
> epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876, 
> u64=876}}}, 4096, 4294967295) = 2
> {code}
> Investigation and tests show that:
> Any Exception thrown during the processor execution will bypass the call to 
> {code} responseReady() {code} and will cause the counter {code}       
> readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be 
> decremented by the size of the request buffer.
> After a bunch of failed requests, this counter almost reaches the max value 
> MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever 
> because the following test in {code} read() {code}:
> {code}           if (readBufferBytesAllocated.get() + frameSize > 
> MAX_READ_BUFFER_BYTES)  {code} is always true.
> At the end, the server thread loops in select() which immediately wakes up 
> for read() since the content of the socket was never drained.
> This loops forever between select and read() method above causing a 100% CPU 
> on server thread.
> Moreover, all client requests are stuck forever.
> Example of failed request:
> {code}
> 2015-09-01T12:19:35.954+0200 b.s.d.nimbus [WARN] Topology submission 
> exception. (topology name='mytopology') #<IllegalArgumentException 
> java.lang.IllegalArgumentException: /opt/SPE/share/stor
> m/storm/local/nimbus/inbox/stormjar-3f8f3ba7-5420-4773-af24-bfa294cceb79.jar 
> to copy to 
> /opt/SPE/share/storm/storm/local/nimbus/stormdist/mytopology-87-1441102775 
> does not exist!>
> 2015-09-01T12:19:35.955+0200 o.a.t.s.TNonblockingServer [ERROR] Unexpected 
> exception while invoking!
> java.lang.IllegalArgumentException: 
> /opt/SPE/share/storm/storm/local/nimbus/inbox/stormjar-3f8f3ba7-5420-4773-af24-bfa294cceb79.jar
>  to copy to /opt/SPE/share/storm/storm/local/nimbus/stormdis
> t/mytopology-87-1441102775 does not exist!
>         at backtype.storm.daemon.nimbus$fn__3827.invoke(nimbus.clj:1173) 
> ~[storm-core-0.9.5.jar:0.9.5]
>         at clojure.lang.MultiFn.invoke(MultiFn.java:236) 
> ~[clojure-1.5.1.jar:na]
>         at 
> backtype.storm.daemon.nimbus$setup_storm_code.invoke(nimbus.clj:307) 
> ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> backtype.storm.daemon.nimbus$fn__3724$exec_fn__1103__auto__$reify__3737.submitTopologyWithOpts(nimbus.clj:953)
>  ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> backtype.storm.daemon.nimbus$fn__3724$exec_fn__1103__auto__$reify__3737.submitTopology(nimbus.clj:966)
>  ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> backtype.storm.generated.Nimbus$Processor$submitTopology.getResult(Nimbus.java:1240)
>  ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> backtype.storm.generated.Nimbus$Processor$submitTopology.getResult(Nimbus.java:1228)
>  ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> org.apache.thrift7.ProcessFunction.process(ProcessFunction.java:32) 
> ~[storm-core-0.9.5.jar:0.9.5]
>         at org.apache.thrift7.TBaseProcessor.process(TBaseProcessor.java:34) 
> ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> org.apache.thrift7.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:632)
>  ~[storm-core-0.9.5.jar:0.9.5]
>         at 
> org.apache.thrift7.server.THsHaServer$Invocation.run(THsHaServer.java:201) 
> [storm-core-0.9.5.jar:0.9.5]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_75]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_75]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to