[
https://issues.apache.org/jira/browse/STORM-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651841#comment-14651841
]
Alex Sobrino commented on STORM-963:
------------------------------------
Hi [~kabhwan],
We're not able to reproduce it when we want. It just happens every now and
then, but it's quite frequent, so we're able to provide some test results.
Executing {{kill -SIGABRT <PID>}} in one of the Python processes writes this
into the worker's log:
{code}
2015-08-03T14:41:30.315+0200 b.s.t.ShellBolt [ERROR] Halting process: ShellBolt
died.
java.lang.RuntimeException: subprocess heartbeat timeout
at
backtype.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:305)
[storm-core-0.9.5.jar:0.9.5]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[na:1.8.0_45]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[na:1.8.0_45]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_45]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[na:1.8.0_45]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_45]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2015-08-03T14:41:30.315+0200 b.s.d.executor [ERROR]
java.lang.RuntimeException: subprocess heartbeat timeout
at
backtype.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:305)
[storm-core-0.9.5.jar:0.9.5]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[na:1.8.0_45]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[na:1.8.0_45]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_45]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[na:1.8.0_45]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_45]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2015-08-03T14:41:30.317+0200 b.s.t.ShellBolt [ERROR] Halting process: ShellBolt
died.
java.lang.RuntimeException: backtype.storm.multilang.NoOutputException: Pipe to
subprocess seems to be broken! No output read.
Serializer Exception:
at
backtype.storm.utils.ShellProcess.readShellMsg(ShellProcess.java:101)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.task.ShellBolt$BoltReaderRunnable.run(ShellBolt.java:318)
~[storm-core-0.9.5.jar:0.9.5]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2015-08-03T14:41:30.318+0200 b.s.d.executor [ERROR]
java.lang.RuntimeException: backtype.storm.multilang.NoOutputException: Pipe to
subprocess seems to be broken! No output read.
Serializer Exception:
at
backtype.storm.utils.ShellProcess.readShellMsg(ShellProcess.java:101)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.task.ShellBolt$BoltReaderRunnable.run(ShellBolt.java:318)
~[storm-core-0.9.5.jar:0.9.5]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2015-08-03T14:41:30.320+0200 b.s.t.ShellBolt [ERROR] Halting process: ShellBolt
died.
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method) ~[na:1.8.0_45]
at java.io.FileOutputStream.write(FileOutputStream.java:326)
~[na:1.8.0_45]
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
~[na:1.8.0_45]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
~[na:1.8.0_45]
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
~[na:1.8.0_45]
at
backtype.storm.multilang.JsonSerializer.writeString(JsonSerializer.java:96)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.multilang.JsonSerializer.writeMessage(JsonSerializer.java:89)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.multilang.JsonSerializer.writeBoltMsg(JsonSerializer.java:74)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.utils.ShellProcess.writeBoltMsg(ShellProcess.java:106)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.task.ShellBolt$BoltWriterRunnable.run(ShellBolt.java:355)
~[storm-core-0.9.5.jar:0.9.5]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2015-08-03T14:41:30.320+0200 b.s.d.executor [ERROR]
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method) ~[na:1.8.0_45]
at java.io.FileOutputStream.write(FileOutputStream.java:326)
~[na:1.8.0_45]
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
~[na:1.8.0_45]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
~[na:1.8.0_45]
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
~[na:1.8.0_45]
at
backtype.storm.multilang.JsonSerializer.writeString(JsonSerializer.java:96)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.multilang.JsonSerializer.writeMessage(JsonSerializer.java:89)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.multilang.JsonSerializer.writeBoltMsg(JsonSerializer.java:74)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.utils.ShellProcess.writeBoltMsg(ShellProcess.java:106)
~[storm-core-0.9.5.jar:0.9.5]
at
backtype.storm.task.ShellBolt$BoltWriterRunnable.run(ShellBolt.java:355)
~[storm-core-0.9.5.jar:0.9.5]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
2015-08-03T14:41:30.357+0200 b.s.d.worker [INFO] Shutting down worker
my-topology-47-1438604011 63b8c93e-7d6e-4d67-b4f8-957c9650e7ba 6700
2015-08-03T14:41:30.358+0200 b.s.m.n.Client [INFO] closing Netty Client
Netty-Client-vm05.v5tech.local/172.28.69.111:6700
2015-08-03T14:41:30.358+0200 b.s.m.n.Client [INFO] waiting up to 600000 ms to
send 0 pending messages to Netty-Client-vm05.v5tech.local/172.28.69.111:6700
2015-08-03T14:41:30.359+0200 b.s.m.n.Client [INFO] closing Netty Client
Netty-Client-vm05.v5tech.local/172.28.69.111:6701
2015-08-03T14:41:30.360+0200 b.s.m.n.Client [INFO] waiting up to 600000 ms to
send 0 pending messages to Netty-Client-vm05.v5tech.local/172.28.69.111:6701
2015-08-03T14:41:30.360+0200 b.s.d.worker [INFO] Shutting down receive thread
2015-08-03T14:41:30.360+0200 b.s.m.n.Client [INFO] creating Netty Client,
connecting to localhost:6700, bufferSize: 5242880
2015-08-03T14:41:30.360+0200 o.a.s.c.r.ExponentialBackoffRetry [WARN]
maxRetries too large (300). Pinning to 29
2015-08-03T14:41:30.360+0200 b.s.u.StormBoundedExponentialBackoffRetry [INFO]
The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
2015-08-03T14:41:30.360+0200 b.s.m.n.Client [INFO] connection attempt 1 to
Netty-Client-localhost/127.0.0.1:6700 scheduled to run in 0 ms
2015-08-03T14:41:30.361+0200 b.s.m.loader [INFO] Shutting down
receiving-thread: [my-topology-47-1438604011, 6700]
2015-08-03T14:41:30.361+0200 b.s.m.n.Client [ERROR] connection to
Netty-Client-localhost/127.0.0.1:6700 is unavailable
2015-08-03T14:41:30.361+0200 b.s.m.n.Client [ERROR] dropping 1 message(s)
destined for Netty-Client-localhost/127.0.0.1:6700
2015-08-03T14:41:30.361+0200 b.s.m.n.Client [INFO] closing Netty Client
Netty-Client-localhost/127.0.0.1:6700
2015-08-03T14:41:30.361+0200 b.s.m.n.Client [INFO] waiting up to 600000 ms to
send 0 pending messages to Netty-Client-localhost/127.0.0.1:6700
2015-08-03T14:41:30.361+0200 b.s.m.loader [INFO] Waiting for
receiving-thread:[my-topology-47-1438604011, 6700] to die
2015-08-03T14:41:30.365+0200 b.s.m.n.Client [INFO] connection established to
Netty-Client-localhost/127.0.0.1:6700
2015-08-03T14:41:30.366+0200 b.s.m.loader [INFO] Shutdown receiving-thread:
[my-topology-47-1438604011, 6700]
2015-08-03T14:41:30.366+0200 b.s.d.worker [INFO] Shut down receive thread
2015-08-03T14:41:30.366+0200 b.s.d.worker [INFO] Terminating messaging context
2015-08-03T14:41:30.366+0200 b.s.d.worker [INFO] Shutting down executors
2015-08-03T14:41:30.368+0200 b.s.d.executor [INFO] Shutting down executor
processBolt:[2 2]
2015-08-03T14:41:30.368+0200 b.s.util [INFO] Async loop interrupted!
2015-08-03T14:41:30.368+0200 b.s.util [INFO] Async loop interrupted!
2015-08-03T14:41:30.370+0200 b.s.d.executor [INFO] Shut down executor
processBolt:[2 2]
2015-08-03T14:41:30.370+0200 b.s.d.executor [INFO] Shutting down executor
processBolt:[5 5]
2015-08-03T14:41:30.371+0200 b.s.util [INFO] Async loop interrupted!
{code}
In the supervisor's log:
{code}
2015-08-03T14:42:01.542+0200 b.s.d.supervisor [INFO] Shutting down and clearing
state for id f2111c3b-31e5-4e72-b16b-ff5897148844. Current supervisor time:
1438605721. State: :timed-out, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1438605690, :storm-id
"my-topology-47-1438604011", :executors #{[2 2] [5 5] [8 8] [11 11] [14 14] [-1
-1]}, :port 6700}
2015-08-03T14:42:01.542+0200 b.s.d.supervisor [INFO] Shutting down
63b8c93e-7d6e-4d67-b4f8-957c9650e7ba:f2111c3b-31e5-4e72-b16b-ff5897148844
2015-08-03T14:42:01.544+0200 b.s.util [INFO] Error when trying to kill 63473.
Process is probably already dead.
2015-08-03T14:42:01.546+0200 b.s.util [INFO] Error when trying to kill 63466.
Process is probably already dead.
2015-08-03T14:42:01.548+0200 b.s.util [INFO] Error when trying to kill 63481.
Process is probably already dead.
2015-08-03T14:42:01.549+0200 b.s.util [INFO] Error when trying to kill 63465.
Process is probably already dead.
2015-08-03T14:42:01.554+0200 b.s.util [INFO] Error when trying to kill 63414.
Process is probably already dead.
2015-08-03T14:42:02.556+0200 b.s.util [INFO] Error when trying to kill 63473.
Process is probably already dead.
2015-08-03T14:42:02.558+0200 b.s.util [INFO] Error when trying to kill 63466.
Process is probably already dead.
2015-08-03T14:42:02.562+0200 b.s.util [INFO] Error when trying to kill 63481.
Process is probably already dead.
2015-08-03T14:42:02.563+0200 b.s.util [INFO] Error when trying to kill 63465.
Process is probably already dead.
2015-08-03T14:42:02.566+0200 b.s.util [INFO] Error when trying to kill 63414.
Process is probably already dead.
2015-08-03T14:42:02.569+0200 b.s.d.supervisor [INFO] Shut down
63b8c93e-7d6e-4d67-b4f8-957c9650e7ba:f2111c3b-31e5-4e72-b16b-ff5897148844
2015-08-03T14:42:02.570+0200 b.s.d.supervisor [INFO] Launching worker with
assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
"my-topology-47-1438604011", :executors ([2 2] [5 5] [8 8] [11 11] [14 14])}
for this supervisor 63b8c93e-7d6e-4d67-b4f8-957c9650e7ba on port 6700 with id
e75a75ff-2eed-430d-bd77-8cd707667b1d
2015-08-03T14:42:02.571+0200 b.s.d.supervisor [INFO] Launching worker with
command: '/opt/java/latest/bin/java' '-server' '-Xmx6144m'
'-Djava.library.path=/var/lib/storm/supervisor/stormdist/my-topology-47-1438604011/resources/Linux-amd64:/var/lib/storm/supervisor/stormdist/my-topology-47-1438604011/resources:/usr/local/lib:/opt/local/lib:/usr/lib'
'-Dlogfile.name=worker-6700.log' '-Dstorm.home=/opt/mydir/apache-storm-0.9.5'
'-Dstorm.conf.file=' '-Dstorm.options='
'-Dstorm.log.dir=/opt/mydir/apache-storm-0.9.5/logs'
'-Dlogback.configurationFile=/opt/mydir/apache-storm-0.9.5/logback/cluster.xml'
'-Dstorm.id=my-topology-47-1438604011'
'-Dworker.id=e75a75ff-2eed-430d-bd77-8cd707667b1d' '-Dworker.port=6700' '-cp'
'/opt/mydir/apache-storm-0.9.5/lib/minlog-1.2.jar:/opt/mydir/apache-storm-0.9.5/lib/carbonite-1.4.0.jar:/opt/mydir/apache-storm-0.9.5/lib/json-simple-1.1.jar:/opt/mydir/apache-storm-0.9.5/lib/slf4j-api-1.7.5.jar:/opt/mydir/apache-storm-0.9.5/lib/kryo-2.21.jar:/opt/mydir/apache-storm-0.9.5/lib/tools.logging-0.2.3.jar:/opt/mydir/apache-storm-0.9.5/lib/ring-servlet-0.3.11.jar:/opt/mydir/apache-storm-0.9.5/lib/tools.cli-0.2.4.jar:/opt/mydir/apache-storm-0.9.5/lib/disruptor-2.10.1.jar:/opt/mydir/apache-storm-0.9.5/lib/clj-stacktrace-0.2.2.jar:/opt/mydir/apache-storm-0.9.5/lib/math.numeric-tower-0.0.1.jar:/opt/ymdir/apache-storm-0.9.5/lib/ring-jetty-adapter-0.3.11.jar:/opt/mydir/apache-storm-0.9.5/lib/commons-io-2.4.jar:/opt/mydir/apache-storm-0.9.5/lib/servlet-api-2.5.jar:/opt/mydir/apache-storm-0.9.5/lib/log4j-over-slf4j-1.6.6.jar:/opt/mydir/apache-storm-0.9.5/lib/core.incubator-0.1.0.jar:/opt/mydir/apache-storm-0.9.5/lib/asm-4.0.jar:/opt/mydir/apache-storm-0.9.5/lib/hiccup-0.3.6.jar:/opt/mydir/apache-storm-0.9.5/lib/jetty-util-6.1.26.jar:/opt/mydir/apache-storm-0.9.5/lib/tools.macro-0.1.0.jar:/opt/mydir/apache-storm-0.9.5/lib/ring-devel-0.3.11.jar:/opt/mydir/apache-storm-0.9.5/lib/commons-exec-1.1.jar:/opt/mydir/apache-storm-0.9.5/lib/ring-core-1.1.5.jar:/opt/mydir/apache-storm-0.9.5/lib/clout-1.0.1.jar:/opt/mydir/apache-storm-0.9.5/lib/jetty-6.1.26.jar:/opt/mydir/apache-storm-0.9.5/lib/objenesis-1.2.jar:/opt/mydir/apache-storm-0.9.5/lib/logback-core-1.0.13.jar:/opt/mydir/apache-storm-0.9.5/lib/jgrapht-core-0.9.0.jar:/opt/mydir/apache-storm-0.9.5/lib/commons-codec-1.6.jar:/opt/mydir/apache-storm-0.9.5/lib/commons-lang-2.5.jar:/opt/mydir/apache-storm-0.9.5/lib/clojure-1.5.1.jar:/opt/mydir/apache-storm-0.9.5/lib/storm-core-0.9.5.jar:/opt/mydir/apache-storm-0.9.5/lib/chill-java-0.3.5.jar:/opt/mydir/apache-storm-0.9.5/lib/reflectasm-1.07-shaded.jar:/opt/mydir/apache-storm-0.9.5/lib/joda-time-2.0.jar:/opt/mydir/apache-storm-0.9.5/lib/commons-logging-1.1.3.jar:/opt/mydir/apache-storm-0.9.5/lib/compojure-1.1.3.jar:/opt/mydir/apache-storm-0.9.5/lib/clj-time-0.4.1.jar:/opt/mydir/apache-storm-0.9.5/lib/jline-2.11.jar:/opt/mydir/apache-storm-0.9.5/lib/commons-fileupload-1.2.1.jar:/opt/mydir/apache-storm-0.9.5/lib/logback-classic-1.0.13.jar:/opt/mydir/apache-storm-0.9.5/lib/snakeyaml-1.11.jar:/opt/mydir/apache-storm-0.9.5/conf:/var/lib/storm/supervisor/stormdist/my-topology-47-1438604011/stormjar.jar'
'backtype.storm.daemon.worker' 'my-topology-47-1438604011'
'63b8c93e-7d6e-4d67-b4f8-957c9650e7ba' '6700'
'e75a75ff-2eed-430d-bd77-8cd707667b1d'
2015-08-03T14:42:02.571+0200 b.s.d.supervisor [INFO]
e75a75ff-2eed-430d-bd77-8cd707667b1d still hasn't started
2015-08-03T14:42:03.072+0200 b.s.d.supervisor [INFO]
e75a75ff-2eed-430d-bd77-8cd707667b1d still hasn't started
2015-08-03T14:42:03.572+0200 b.s.d.supervisor [INFO]
e75a75ff-2eed-430d-bd77-8cd707667b1d still hasn't started
{code}
And nimbus' log:
{code}
2015-08-03T14:42:03.959+0200 b.s.d.nimbus [INFO] Executor
my-topology-47-1438604011:[2 2] not alive
2015-08-03T14:42:03.959+0200 b.s.d.nimbus [INFO] Executor
my-topology-47-1438604011:[5 5] not alive
2015-08-03T14:42:03.959+0200 b.s.d.nimbus [INFO] Executor
my-topology-47-1438604011:[8 8] not alive
2015-08-03T14:42:03.959+0200 b.s.d.nimbus [INFO] Executor
my-topology-47-1438604011:[11 11] not alive
2015-08-03T14:42:03.959+0200 b.s.d.nimbus [INFO] Executor
my-topology-47-1438604011:[14 14] not alive
2015-08-03T14:42:03.964+0200 b.s.s.EvenScheduler [INFO] Available slots:
(["3e795735-aa1d-483d-8140-1faa9eb9161c" 6702]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6703]
["63b8c93e-7d6e-4d67-b4f8-957c9650e7ba" 6701]
["63b8c93e-7d6e-4d67-b4f8-957c9650e7ba" 6702]
["63b8c93e-7d6e-4d67-b4f8-957c9650e7ba" 6703]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6700]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6701]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6702]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6703])
2015-08-03T14:42:03.964+0200 b.s.d.nimbus [INFO] Reassigning
my-topology-47-1438604011 to 3 slots
2015-08-03T14:42:03.964+0200 b.s.d.nimbus [INFO] Reassign executors: [[2 2] [5
5] [8 8] [11 11] [14 14]]
2015-08-03T14:42:03.969+0200 b.s.d.nimbus [INFO] Setting new assignment for
topology id my-topology-47-1438604011:
#backtype.storm.daemon.common.Assignment{:master-code-dir
"/var/lib/storm/nimbus/stormdist/my-topology-47-1438604011", :node->host
{"1dbdec2d-9e91-42e5-beac-2f2848667e14" "vm06.v5tech.local",
"3e795735-aa1d-483d-8140-1faa9eb9161c" "vm05.v5tech.local"},
:executor->node+port {[2 2] ["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6700], [3
3] ["3e795735-aa1d-483d-8140-1faa9eb9161c" 6700], [4 4]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6701], [5 5]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6700], [6 6]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6700], [7 7]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6701], [8 8]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6700], [9 9]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6700], [10 10]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6701], [11 11]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6700], [12 12]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6700], [13 13]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6701], [14 14]
["1dbdec2d-9e91-42e5-beac-2f2848667e14" 6700], [15 15]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6700], [1 1]
["3e795735-aa1d-483d-8140-1faa9eb9161c" 6701]}, :executor->start-time-secs {[2
2] 1438605723, [3 3] 1438605182, [4 4] 1438605573, [5 5] 1438605723, [6 6]
1438605182, [7 7] 1438605573, [8 8] 1438605723, [9 9] 1438605182, [10 10]
1438605573, [11 11] 1438605723, [12 12] 1438605182, [13 13] 1438605573, [14 14]
1438605723, [15 15] 1438605182, [1 1] 1438605573}}
{code}
Basically, it rebalances the topology removing the node where we executed the
{{kill}} command. That produces a lot of error messages and some Netty related
stuff. After that, everything seems to work again. In fact, we can rebalance
again the topology (via UI) and everything works fine for a random amount of
time (minutes or hours).
Right now, after 15 min, it's frozen and we don't find a single line in the
logs:
{code}
2015-08-03T14:51:22.028+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:51:50.981+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:51:51.211+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:51:52.289+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:51:58.181+0200 s.k.ZkCoordinator [INFO] Task [3/3] Refreshing
partition manager connections
2015-08-03T14:51:58.183+0200 s.k.DynamicBrokersReader [INFO] Read partition
info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
1=kafka2:9092, 2=kafka3:9092}}
2015-08-03T14:51:58.183+0200 s.k.KafkaUtils [INFO] Task [3/3] assigned
[Partition{host=kafka3:9092, partition=2}]
2015-08-03T14:51:58.183+0200 s.k.ZkCoordinator [INFO] Task [3/3] Deleted
partition managers: []
2015-08-03T14:51:58.183+0200 s.k.ZkCoordinator [INFO] Task [3/3] New partition
managers: []
2015-08-03T14:51:58.183+0200 s.k.ZkCoordinator [INFO] Task [3/3] Finished
refreshing
2015-08-03T14:52:22.251+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:52:22.278+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:52:22.289+0200 b.s.t.ShellBolt [INFO] ShellLog pid:64304,
name:processBolt I'm doing something
2015-08-03T14:52:58.184+0200 s.k.ZkCoordinator [INFO] Task [3/3] Refreshing
partition manager connections
2015-08-03T14:52:58.187+0200 s.k.DynamicBrokersReader [INFO] Read partition
info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
1=kafka2:9092, 2=kafka3:9092}}
2015-08-03T14:52:58.187+0200 s.k.KafkaUtils [INFO] Task [3/3] assigned
[Partition{host=kafka12.v5tech.local:9092, partition=2}]
2015-08-03T14:52:58.187+0200 s.k.ZkCoordinator [INFO] Task [3/3] Deleted
partition managers: []
2015-08-03T14:52:58.187+0200 s.k.ZkCoordinator [INFO] Task [3/3] New partition
managers: []
2015-08-03T14:52:58.187+0200 s.k.ZkCoordinator [INFO] Task [3/3] Finished
refreshing
2015-08-03T14:53:58.189+0200 s.k.ZkCoordinator [INFO] Task [3/3] Refreshing
partition manager connections
2015-08-03T14:53:58.192+0200 s.k.DynamicBrokersReader [INFO] Read partition
info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
1=kafka2:9092, 2=kafka3:9092}}
2015-08-03T14:53:58.193+0200 s.k.KafkaUtils [INFO] Task [3/3] assigned
[Partition{host=kafka3:9092, partition=2}]
2015-08-03T14:53:58.193+0200 s.k.ZkCoordinator [INFO] Task [3/3] Deleted
partition managers: []
2015-08-03T14:53:58.193+0200 s.k.ZkCoordinator [INFO] Task [3/3] New partition
managers: []
2015-08-03T14:53:58.193+0200 s.k.ZkCoordinator [INFO] Task [3/3] Finished
refreshing
2015-08-03T14:54:58.194+0200 s.k.ZkCoordinator [INFO] Task [3/3] Refreshing
partition manager connections
2015-08-03T14:54:58.198+0200 s.k.DynamicBrokersReader [INFO] Read partition
info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
1=kafka2:9092, 2=kafka3:9092}}
2015-08-03T14:54:58.199+0200 s.k.KafkaUtils [INFO] Task [3/3] assigned
[Partition{host=kafka3:9092, partition=2}]
2015-08-03T14:54:58.199+0200 s.k.ZkCoordinator [INFO] Task [3/3] Deleted
partition managers: []
2015-08-03T14:54:58.199+0200 s.k.ZkCoordinator [INFO] Task [3/3] New partition
managers: []
2015-08-03T14:54:58.199+0200 s.k.ZkCoordinator [INFO] Task [3/3] Finished
refreshing
{code}
I'll add the {{jstack}} results later on.
I can provide you with full logs from a fresh topology deploy if a secure
upload URL is provided.
Hope this information helps a little bit.
> Frozen topology (KafkaSpout + Multilang bolt)
> ---------------------------------------------
>
> Key: STORM-963
> URL: https://issues.apache.org/jira/browse/STORM-963
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-kafka
> Affects Versions: 0.9.4, 0.9.5, 0.9.6
> Environment: - VMware ESX 5.5
> - Ubuntu Server 14.04 LTS (kernel 3.16.0-41-generic)
> - Java (TM) SE Runtime Environment (build 1.8.0_45-b14)
> - Python 2.7.6 (default, Jun 22 2015, 17:58:13)
> - Zookeeper 3.4.6
> Reporter: Alex Sobrino
> Labels: multilang
>
> Hi,
> We've got a pretty simple topology running with Storm 0.9.5 (tried also with
> 0.9.4 and 0.9.6-INCUBATING) in a 3 machine cluster:
> {code}kafkaSpout (3) -----> processBolt (12){code}
> Some info:
> - kafkaSpout reads from a topic with 3 partitions and 2 replications
> - processBolt iterates throught the message and saves the results in MongoDB
> - processBolt is implemented in Python and has a storm.log("I'm doing
> something") just to add a simple debug message in the logs
> - The messages can be quite big (~25-40 MB) and are in JSON format
> - The kafka topic has a retention of 2 hours
> - We use the same ZooKeeper cluster to both Kafka and Storm
> The topology gets frozen after several hours (not days) running. We don't see
> any message in the logs... In fact, the periodic message from s.k.KafkaUtils
> and s.k.ZkCoordinator disapears. As you can imagine, the message from the
> Bolt also dissapears. Logs are copy/pasted further on. If we redeploy the
> topology everything starts to work again until it becomes frozen again.
> Our kafkaSpout config is:
> {code}
> ZkHosts zkHosts = new ZkHosts("zkhost01:2181,zkhost02:2181,zkhost03:2181");
> SpoutConfig kafkaConfig = new SpoutConfig(zkHosts, "topic",
> "/topic/ourclientid", "ourclientid");
> kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
> kafkaConfig.fetchSizeBytes = 50*1024*1024;
> kafkaConfig.bufferSizeBytes = 50*1024*1024;
> {code}
> We've also tried setting the following options
> {code}
> kafkaConfig.forceFromStart = true;
> kafkaConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime(); // Also
> with kafka.api.OffsetRequest.LatestTime();
> kafkaConfig.useStartOffsetTimeIfOffsetOutOfRange = true;
> {code}
> Right now the topology is running without acking the messages since there's a
> bug in kafkaSpout with failed messages and deleted offsets in Kafka.
> This is what can be seen in the logs in one of the workers:
> {code}
> 2015-07-23T12:37:38.008+0200 b.s.t.ShellBolt [INFO] ShellLog pid:28364,
> name:processBolt I'm doing something
> 2015-07-23T12:37:39.079+0200 b.s.t.ShellBolt [INFO] ShellLog pid:28364,
> name:processBolt I'm doing something
> 2015-07-23T12:37:51.013+0200 b.s.t.ShellBolt [INFO] ShellLog pid:28364,
> name:processBolt I'm doing something
> 2015-07-23T12:37:51.091+0200 b.s.t.ShellBolt [INFO] ShellLog pid:28364,
> name:processBolt I'm doing something
> 2015-07-23T12:38:02.684+0200 s.k.ZkCoordinator [INFO] Task [2/3] Refreshing
> partition manager connections
> 2015-07-23T12:38:02.687+0200 s.k.DynamicBrokersReader [INFO] Read partition
> info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
> 1=kafka2:9092, 2=kafka3:9092}}
> 2015-07-23T12:38:02.687+0200 s.k.KafkaUtils [INFO] Task [2/3] assigned
> [Partition{host=kafka2, partition=1}]
> 2015-07-23T12:38:02.687+0200 s.k.ZkCoordinator [INFO] Task [2/3] Deleted
> partition managers: []
> 2015-07-23T12:38:02.687+0200 s.k.ZkCoordinator [INFO] Task [2/3] New
> partition managers: []
> 2015-07-23T12:38:02.687+0200 s.k.ZkCoordinator [INFO] Task [2/3] Finished
> refreshing
> 2015-07-23T12:38:09.012+0200 b.s.t.ShellBolt [INFO] ShellLog pid:28364,
> name:processBolt I'm doing something
> 2015-07-23T12:38:41.878+0200 b.s.t.ShellBolt [INFO] ShellLog pid:28364,
> name:processBolt I'm doing something
> 2015-07-23T12:39:02.688+0200 s.k.ZkCoordinator [INFO] Task [2/3] Refreshing
> partition manager connections
> 2015-07-23T12:39:02.691+0200 s.k.DynamicBrokersReader [INFO] Read partition
> info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
> 1=kafka2:9092, 2=kafka3:9092}}
> 2015-07-23T12:39:02.691+0200 s.k.KafkaUtils [INFO] Task [2/3] assigned
> [Partition{host=kafka2:9092, partition=1}]
> 2015-07-23T12:39:02.691+0200 s.k.ZkCoordinator [INFO] Task [2/3] Deleted
> partition managers: []
> 2015-07-23T12:39:02.691+0200 s.k.ZkCoordinator [INFO] Task [2/3] New
> partition managers: []
> 2015-07-23T12:39:02.691+0200 s.k.ZkCoordinator [INFO] Task [2/3] Finished
> refreshing
> 2015-07-23T12:40:02.692+0200 s.k.ZkCoordinator [INFO] Task [2/3] Refreshing
> partition manager connections
> 2015-07-23T12:40:02.695+0200 s.k.DynamicBrokersReader [INFO] Read partition
> info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
> 1=kafka2:9092, 2=kafka3:9092}}
> 2015-07-23T12:40:02.695+0200 s.k.KafkaUtils [INFO] Task [2/3] assigned
> [Partition{host=kafka2:9092, partition=1}]
> 2015-07-23T12:40:02.695+0200 s.k.ZkCoordinator [INFO] Task [2/3] Deleted
> partition managers: []
> 2015-07-23T12:40:02.695+0200 s.k.ZkCoordinator [INFO] Task [2/3] New
> partition managers: []
> 2015-07-23T12:40:02.695+0200 s.k.ZkCoordinator [INFO] Task [2/3] Finished
> refreshing
> 2015-07-23T12:41:02.696+0200 s.k.ZkCoordinator [INFO] Task [2/3] Refreshing
> partition manager connections
> 2015-07-23T12:41:02.699+0200 s.k.DynamicBrokersReader [INFO] Read partition
> info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
> 1=kafka2:9092, 2=kafka3:9092}}
> 2015-07-23T12:41:02.699+0200 s.k.KafkaUtils [INFO] Task [2/3] assigned
> [Partition{host=kafka2:9092, partition=1}]
> 2015-07-23T12:41:02.699+0200 s.k.ZkCoordinator [INFO] Task [2/3] Deleted
> partition managers: []
> 2015-07-23T12:41:02.699+0200 s.k.ZkCoordinator [INFO] Task [2/3] New
> partition managers: []
> 2015-07-23T12:41:02.699+0200 s.k.ZkCoordinator [INFO] Task [2/3] Finished
> refreshing
> 2015-07-23T12:42:02.735+0200 s.k.ZkCoordinator [INFO] Task [2/3] Refreshing
> partition manager connections
> 2015-07-23T12:42:02.737+0200 s.k.DynamicBrokersReader [INFO] Read partition
> info from zookeeper: GlobalPartitionInformation{partitionMap={0=kafka1:9092,
> 1=kafka2:9092, 2=kafka3:9092}}
> 2015-07-23T12:42:02.737+0200 s.k.KafkaUtils [INFO] Task [2/3] assigned
> [Partition{host=kafka2:9092, partition=1}]
> 2015-07-23T12:42:02.737+0200 s.k.ZkCoordinator [INFO] Task [2/3] Deleted
> partition managers: []
> 2015-07-23T12:42:02.737+0200 s.k.ZkCoordinator [INFO] Task [2/3] New
> partition managers: []
> 2015-07-23T12:42:02.737+0200 s.k.ZkCoordinator [INFO] Task [2/3] Finished
> refreshing
> {code}
> and then it becomes frozen. Nothing is written into the nimbus log. We've
> checked the offsets in ZooKeeper and they're not updated:
> {code}
> {"topology":{"id":"218e58a5-6bfb-4b32-ae89-f3afa19306e1","name":"our-topology"},"offset":12047144,"partition":1,"broker":{"host":"kafka2","port":9092},"topic":"topic"}
> cZxid = 0x100028958
> ctime = Wed Jul 01 12:22:36 CEST 2015
> mZxid = 0x100518527
> mtime = Thu Jul 23 12:42:41 CEST 2015
> pZxid = 0x100028958
> cversion = 0
> dataVersion = 446913
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 183
> numChildren = 0
> {code}
> Any ideas of what we could be missing?
> PS: This was sent to the Storm user's mailing list and got 0 replies :\
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)