[
https://issues.apache.org/jira/browse/STORM-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frantz Mazoyer updated STORM-1023:
----------------------------------
Labels: (was: newbie)
Description:
Testing environment is Storm 0.9.5 / thrift java 0.7.
Test scenario:
Deploy storm topology in loop.
When nimbus cleanup timeout is reached, an error is thrown by thrift server:
"Exception while invoking ..." ... TException
Test result:
Thrift java server in nimbus goes 100% CPU in infinite loop in:
jstack:
{code}
"Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable
[0x00007fb129c9b000]
java.lang.Thread.State: RUNNABLE
at
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
...
at
org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284)
{code}
strace:
{code}
epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876, u64=876}}},
4096, 4294967295) = 2
{code}
Investigation and tests show that:
Any Exception thrown during the processor execution will bypass the call to
{code} responseReady() {code} and will cause the counter {code}
readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be
decremented by the size of the request buffer.
After a bunch of failed requests, this counter almost reaches the max value
MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever
because the following test in {code} read() {code}:
{code} if (readBufferBytesAllocated.get() + frameSize >
MAX_READ_BUFFER_BYTES) {code} is always true.
At the end, the server thread loops in select() which immediately wakes up for
read() since the content of the socket was never drained.
This loops forever between select and read() method above causing a 100% CPU on
server thread.
Moreover, all client requests are stuck forever.
Example of failed request:
{code}
2015-09-01T12:19:35.954+0200 b.s.d.nimbus [WARN] Topology submission exception.
(topology name='mytopology') #<IllegalArgumentException
java.lang.IllegalArgumentException: /opt/SPE/share/stor
m/storm/local/nimbus/inbox/stormjar-3f8f3ba7-5420-4773-af24-bfa294cceb79.jar to
copy to
/opt/SPE/share/storm/storm/local/nimbus/stormdist/mytopology-87-1441102775 does
not exist!>
2015-09-01T12:19:35.955+0200 o.a.t.s.TNonblockingServer [ERROR] Unexpected
exception while invoking!
java.lang.IllegalArgumentException:
/opt/SPE/share/storm/storm/local/nimbus/inbox/stormjar-3f8f3ba7-5420-4773-af24-bfa294cceb79.jar
to copy to /opt/SPE/share/storm/storm/local/nimbus/stormdis
t/mytopology-87-1441102775 does not exist!
at backtype.storm.daemon.nimbus$fn__3827.invoke(nimbus.clj:1173)
~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.MultiFn.invoke(MultiFn.java:236) ~[clojure-1.5.1.jar:na]
at backtype.storm.daemon.nimbus$setup_storm_code.invoke(nimbus.clj:307)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.daemon.nimbus$fn__3724$exec_fn__1103__auto__$reify__3737.submitTopologyWithOpts(nimbus.clj:953)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.daemon.nimbus$fn__3724$exec_fn__1103__auto__$reify__3737.submitTopology(nimbus.clj:966)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.generated.Nimbus$Processor$submitTopology.getResult(Nimbus.java:1240)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.generated.Nimbus$Processor$submitTopology.getResult(Nimbus.java:1228)
~[storm-core-0.9.5.jar:0.9.5]
at org.apache.thrift7.ProcessFunction.process(ProcessFunction.java:32)
~[storm-core-0.9.5.jar:0.9.5]
at org.apache.thrift7.TBaseProcessor.process(TBaseProcessor.java:34)
~[storm-core-0.9.5.jar:0.9.5]
at
org.apache.thrift7.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:632)
~[storm-core-0.9.5.jar:0.9.5]
at
org.apache.thrift7.server.THsHaServer$Invocation.run(THsHaServer.java:201)
[storm-core-0.9.5.jar:0.9.5]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_75]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
{code}
was:
Testing environment is Storm 0.9.5 / thrift java 0.7.
Test scenario:
Deploy storm topology in loop.
When nimbus cleanup timeout is reached, an error is thrown by thrift server:
"Exception while invoking ..." ... TException
Test result:
Thrift java server in nimbus goes 100% CPU in infinite loop in:
jstack:
{code}
"Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable
[0x00007fb129c9b000]
java.lang.Thread.State: RUNNABLE
at
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
...
at
org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284)
{code}
strace:
{code}
epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876, u64=876}}},
4096, 4294967295) = 2
{code}
Investigation and tests show that:
Any Exception thrown during the processor execution will bypass the call to
{code} responseReady() {code} and will cause the counter {code}
readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be
decremented by the size of the request buffer.
After a bunch of failed requests, this counter almost reaches the max value
MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever
because the following test in {code} read() {code}:
{code} if (readBufferBytesAllocated.get() + frameSize >
MAX_READ_BUFFER_BYTES) {code} is always true.
At the end, the server thread loops in select() which immediately wakes up for
read() since the content of the socket was never drained.
This loops forever between select and read() method above causing a 100% CPU on
server thread.
Moreover, all client requests are stuck forever.
> Nimbus server hogs 100% CPU and clients are stuck
> --------------------------------------------------
>
> Key: STORM-1023
> URL: https://issues.apache.org/jira/browse/STORM-1023
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.3, 0.10.0, 0.9.4, 0.11.0, 0.9.5, 0.9.6
> Environment: Storm 0.9.5 / thrift 0.7
> Reporter: Frantz Mazoyer
>
> Testing environment is Storm 0.9.5 / thrift java 0.7.
> Test scenario:
> Deploy storm topology in loop.
> When nimbus cleanup timeout is reached, an error is thrown by thrift
> server:
> "Exception while invoking ..." ... TException
> Test result:
> Thrift java server in nimbus goes 100% CPU in infinite loop in:
> jstack:
> {code}
> "Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable
> [0x00007fb129c9b000]
> java.lang.Thread.State: RUNNABLE
> at
> sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at
> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> at
> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> ...
> at
> org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284)
>
> {code}
> strace:
> {code}
> epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876,
> u64=876}}}, 4096, 4294967295) = 2
> {code}
> Investigation and tests show that:
> Any Exception thrown during the processor execution will bypass the call to
> {code} responseReady() {code} and will cause the counter {code}
> readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be
> decremented by the size of the request buffer.
> After a bunch of failed requests, this counter almost reaches the max value
> MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever
> because the following test in {code} read() {code}:
> {code} if (readBufferBytesAllocated.get() + frameSize >
> MAX_READ_BUFFER_BYTES) {code} is always true.
> At the end, the server thread loops in select() which immediately wakes up
> for read() since the content of the socket was never drained.
> This loops forever between select and read() method above causing a 100% CPU
> on server thread.
> Moreover, all client requests are stuck forever.
> Example of failed request:
> {code}
> 2015-09-01T12:19:35.954+0200 b.s.d.nimbus [WARN] Topology submission
> exception. (topology name='mytopology') #<IllegalArgumentException
> java.lang.IllegalArgumentException: /opt/SPE/share/stor
> m/storm/local/nimbus/inbox/stormjar-3f8f3ba7-5420-4773-af24-bfa294cceb79.jar
> to copy to
> /opt/SPE/share/storm/storm/local/nimbus/stormdist/mytopology-87-1441102775
> does not exist!>
> 2015-09-01T12:19:35.955+0200 o.a.t.s.TNonblockingServer [ERROR] Unexpected
> exception while invoking!
> java.lang.IllegalArgumentException:
> /opt/SPE/share/storm/storm/local/nimbus/inbox/stormjar-3f8f3ba7-5420-4773-af24-bfa294cceb79.jar
> to copy to /opt/SPE/share/storm/storm/local/nimbus/stormdis
> t/mytopology-87-1441102775 does not exist!
> at backtype.storm.daemon.nimbus$fn__3827.invoke(nimbus.clj:1173)
> ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.MultiFn.invoke(MultiFn.java:236)
> ~[clojure-1.5.1.jar:na]
> at
> backtype.storm.daemon.nimbus$setup_storm_code.invoke(nimbus.clj:307)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> backtype.storm.daemon.nimbus$fn__3724$exec_fn__1103__auto__$reify__3737.submitTopologyWithOpts(nimbus.clj:953)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> backtype.storm.daemon.nimbus$fn__3724$exec_fn__1103__auto__$reify__3737.submitTopology(nimbus.clj:966)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> backtype.storm.generated.Nimbus$Processor$submitTopology.getResult(Nimbus.java:1240)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> backtype.storm.generated.Nimbus$Processor$submitTopology.getResult(Nimbus.java:1228)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> org.apache.thrift7.ProcessFunction.process(ProcessFunction.java:32)
> ~[storm-core-0.9.5.jar:0.9.5]
> at org.apache.thrift7.TBaseProcessor.process(TBaseProcessor.java:34)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> org.apache.thrift7.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:632)
> ~[storm-core-0.9.5.jar:0.9.5]
> at
> org.apache.thrift7.server.THsHaServer$Invocation.run(THsHaServer.java:201)
> [storm-core-0.9.5.jar:0.9.5]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_75]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_75]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)