[
https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059503#comment-15059503
]
Gergely Nagy commented on STORM-130:
------------------------------------
Hi folks. I experience the same bug in 0.10.0. In my case the supervisor
survives but the topology dies and it's getting rebalanced to another
supervisor.
This is what I see in the supervisor log:
{code}
2015-12-16 12:56:27.746 b.s.u.Utils [INFO] Using defaults.yaml from resources
2015-12-16 12:56:27.784 b.s.u.Utils [INFO] Using storm.yaml from resources
2015-12-16 12:56:27.841 b.s.d.supervisor [INFO] Launching worker with command:
'java' '-cp'
'/home/fogetti/downloads/apache-storm-0.10.0/lib/kryo-2.21.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-over-slf4j-1.6.6.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-slf4j-impl-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/minlog-1.2.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/asm-4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-core-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/reflectasm-1.07-shaded.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-api-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/hadoop-auth-2.4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/slf4j-api-1.7.7.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/clojure-1.6.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/servlet-api-2.5.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/disruptor-2.10.4.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/storm-core-0.10.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/conf:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormjar.jar'
'-Dlogfile.name=phish-storm-topology-2-1450237054-worker-6700.log'
'-Dstorm.home=/home/fogetti/downloads/apache-storm-0.10.0'
'-Dstorm.id=phish-storm-topology-2-1450237054'
'-Dworker.id=d3819964-7671-447d-8763-827ab5cd6140' '-Dworker.port=6700'
'-Dstorm.log.dir=/home/fogetti/downloads/apache-storm-0.10.0/logs'
'-Dlog4j.configurationFile=/home/fogetti/downloads/apache-storm-0.10.0/log4j2/worker.xml'
'backtype.storm.LogWriter' 'java' '-server' '-Xmx768m'
'-Djava.library.path=/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/resources/Linux-amd64:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/resources:/usr'
'-Dlogfile.name=phish-storm-topology-2-1450237054-worker-6700.log'
'-Dstorm.home=/home/fogetti/downloads/apache-storm-0.10.0' '-Dstorm.conf.file='
'-Dstorm.options='
'-Dstorm.log.dir=/home/fogetti/downloads/apache-storm-0.10.0/logs'
'-Dlogging.sensitivity=S3'
'-Dlog4j.configurationFile=/home/fogetti/downloads/apache-storm-0.10.0/log4j2/worker.xml'
'-Dstorm.id=phish-storm-topology-2-1450237054'
'-Dworker.id=d3819964-7671-447d-8763-827ab5cd6140' '-Dworker.port=6700' '-cp'
'/home/fogetti/downloads/apache-storm-0.10.0/lib/kryo-2.21.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-over-slf4j-1.6.6.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-slf4j-impl-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/minlog-1.2.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/asm-4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-core-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/reflectasm-1.07-shaded.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/log4j-api-2.1.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/hadoop-auth-2.4.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/slf4j-api-1.7.7.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/clojure-1.6.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/servlet-api-2.5.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/disruptor-2.10.4.jar:/home/fogetti/downloads/apache-storm-0.10.0/lib/storm-core-0.10.0.jar:/home/fogetti/downloads/apache-storm-0.10.0/conf:/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormjar.jar'
'backtype.storm.daemon.worker' 'phish-storm-topology-2-1450237054'
'6c528751-1a10-4c33-bd54-a1ec9cb26d86' '6700'
'd3819964-7671-447d-8763-827ab5cd6140'
2015-12-16 12:56:27.888 b.s.config [INFO] SET worker-user
d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:56:27.948 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:28.169 b.s.d.supervisor [INFO] Removing code for storm id
phish-storm-topology-2-1450237054
2015-12-16 12:56:28.454 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:28.956 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:29.457 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:29.957 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:30.458 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:30.960 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:31.463 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:31.964 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:32.465 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:32.966 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:33.467 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:33.968 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:34.469 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:34.970 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:35.471 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:35.972 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:36.473 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:36.974 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:37.076 b.s.d.supervisor [INFO] Worker Process
d3819964-7671-447d-8763-827ab5cd6140 exited with code: 13
2015-12-16 12:56:37.475 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:56:37.976 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
...
2015-12-16 12:58:23.196 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:23.697 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:24.198 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:24.699 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:25.200 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:25.701 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:26.202 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:26.702 b.s.d.supervisor [INFO]
d3819964-7671-447d-8763-827ab5cd6140 still hasn't started
2015-12-16 12:58:27.203 b.s.d.supervisor [INFO] Worker
d3819964-7671-447d-8763-827ab5cd6140 failed to start
2015-12-16 12:58:27.206 b.s.d.supervisor [INFO] Shutting down and clearing
state for id d3819964-7671-447d-8763-827ab5cd6140. Current supervisor time:
1450238307. State: :not-started, Heartbeat: nil
2015-12-16 12:58:27.207 b.s.d.supervisor [INFO] Shutting down
6c528751-1a10-4c33-bd54-a1ec9cb26d86:d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:58:27.208 b.s.config [INFO] GET worker-user
d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:58:27.259 b.s.util [INFO] Error when trying to kill 5076. Process
is probably already dead.
2015-12-16 12:58:27.260 b.s.d.supervisor [INFO] Sleep 1 seconds for execution
of cleanup threads on worker.
2015-12-16 12:58:28.266 b.s.util [INFO] Error when trying to kill 5076. Process
is probably already dead.
2015-12-16 12:58:28.275 b.s.config [INFO] REMOVE worker-user
d3819964-7671-447d-8763-827ab5cd6140
2015-12-16 12:58:28.276 b.s.d.supervisor [INFO] Shut down
6c528751-1a10-4c33-bd54-a1ec9cb26d86:d3819964-7671-447d-8763-827ab5cd6140
{code}
And this is what I see in the worker log:
{code}
2015-12-16 12:56:36.702 b.s.u.Utils [INFO] Using defaults.yaml from resources
2015-12-16 12:56:36.769 b.s.u.Utils [INFO] Using storm.yaml from resources
2015-12-16 12:56:36.921 b.s.d.worker [INFO] Launching worker for
phish-storm-topology-2-1450237054 on 6c528751-1a10-4c33-bd54-a1ec9cb26d86:6700
with id d3819964-7671-447d-8763-827ab5cd6140 and conf
{"topology.builtin.metrics.bucket.size.secs" 60, "nimbus.childopts"
"-Xmx1024m", "ui.filter.params" nil, "storm.cluster.mode" "distributed",
"storm.messaging.netty.client_worker_threads" 1,
"supervisor.run.worker.as.user" false, "topology.max.task.parallelism" nil,
"zmq.threads" 1, "storm.group.mapping.service"
"backtype.storm.security.auth.ShellBasedGroupsMapping",
"transactional.zookeeper.root" "/transactional",
"topology.sleep.spout.wait.strategy.time.ms" 1, "drpc.invocations.port" 3773,
"topology.multilang.serializer" "backtype.storm.multilang.JsonSerializer",
"storm.messaging.netty.server_worker_threads" 1,
"topology.max.error.report.per.interval" 5, "storm.thrift.transport"
"backtype.storm.security.auth.SimpleTransportPlugin", "zmq.hwm" 0,
"storm.principal.tolocal"
"backtype.storm.security.auth.DefaultPrincipalToLocal",
"supervisor.worker.shutdown.sleep.secs" 1, "storm.zookeeper.retry.times" 5,
"ui.actions.enabled" true, "zmq.linger.millis" 5000, "supervisor.enable" true,
"topology.stats.sample.rate" 0.05, "storm.messaging.netty.min_wait_ms" 100,
"storm.zookeeper.port" 2181, "supervisor.heartbeat.frequency.secs" 5,
"topology.enable.message.timeouts" true, "drpc.worker.threads" 64,
"drpc.queue.size" 128, "drpc.https.keystore.password" "", "logviewer.port"
8000, "nimbus.reassign" true, "topology.executor.send.buffer.size" 1024,
"topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy",
"ui.host" "0.0.0.0", "storm.nimbus.retry.interval.millis" 2000,
"nimbus.inbox.jar.expiration.secs" 3600, "dev.zookeeper.path"
"/tmp/dev-storm-zookeeper", "topology.acker.executors" nil,
"topology.fall.back.on.java.serialization" true, "storm.zookeeper.servers"
["dimebag" "petrucci" "hetfield"], "nimbus.thrift.threads" 64,
"logviewer.cleanup.age.mins" 10080, "topology.worker.childopts" nil,
"topology.classpath" nil, "supervisor.monitor.frequency.secs" 3,
"nimbus.credential.renewers.freq.secs" 600,
"topology.skip.missing.kryo.registrations" false,
"drpc.authorizer.acl.filename" "drpc-auth-acl.yaml",
"storm.group.mapping.service.cache.duration.secs" 120,
"topology.testing.always.try.serialize" false, "nimbus.monitor.freq.secs" 10,
"supervisor.supervisors" [], "topology.tasks" nil,
"topology.bolts.outgoing.overflow.buffer.enable" false,
"storm.messaging.netty.socket.backlog" 500, "topology.workers" 1,
"storm.local.dir" "/home/fogetti/downloads/apache-storm-0.10.0/storm-local",
"worker.childopts" "-Xmx768m", "storm.auth.simple-white-list.users" [],
"topology.message.timeout.secs" 30,
"topology.state.synchronization.timeout.secs" 60, "topology.tuple.serializer"
"backtype.storm.serialization.types.ListDelegateSerializer",
"supervisor.supervisors.commands" [], "logviewer.childopts" "-Xmx128m",
"topology.environment" nil, "topology.debug" false,
"storm.messaging.netty.max_retries" 300, "ui.childopts" "-Xmx768m",
"storm.zookeeper.session.timeout" 20000, "drpc.childopts" "-Xmx768m",
"drpc.http.creds.plugin"
"backtype.storm.security.auth.DefaultHttpCredentialsPlugin",
"storm.zookeeper.connection.timeout" 15000, "storm.zookeeper.auth.user" nil,
"storm.meta.serialization.delegate"
"backtype.storm.serialization.GzipThriftSerializationDelegate",
"topology.max.spout.pending" nil, "nimbus.supervisor.timeout.secs" 60,
"nimbus.task.timeout.secs" 30, "drpc.port" 3772,
"storm.zookeeper.retry.intervalceiling.millis" 30000, "nimbus.thrift.port"
6627, "storm.auth.simple-acl.admins" [], "storm.nimbus.retry.times" 5,
"supervisor.worker.start.timeout.secs" 120, "storm.zookeeper.retry.interval"
1000, "logs.users" nil, "transactional.zookeeper.port" nil,
"drpc.max_buffer_size" 1048576, "task.credentials.poll.secs" 30,
"drpc.https.keystore.type" "JKS", "topology.worker.receiver.thread.count" 1,
"supervisor.slots.ports" [6700 6701], "topology.transfer.buffer.size" 1024,
"topology.worker.shared.thread.pool.size" 4, "drpc.authorizer.acl.strict"
false, "nimbus.file.copy.expiration.secs" 600,
"topology.executor.receive.buffer.size" 1024, "nimbus.task.launch.secs" 120,
"storm.local.mode.zmq" false, "storm.messaging.netty.buffer_size" 5242880,
"worker.heartbeat.frequency.secs" 1, "ui.http.creds.plugin"
"backtype.storm.security.auth.DefaultHttpCredentialsPlugin",
"storm.zookeeper.root" "/storm", "topology.tick.tuple.freq.secs" nil,
"drpc.https.port" -1, "task.refresh.poll.secs" 10,
"task.heartbeat.frequency.secs" 3, "storm.messaging.netty.max_wait_ms" 1000,
"nimbus.impersonation.authorizer"
"backtype.storm.security.auth.authorizer.ImpersonationAuthorizer",
"drpc.http.port" 3774, "topology.error.throttle.interval.secs" 10,
"storm.messaging.transport" "backtype.storm.messaging.netty.Context",
"storm.messaging.netty.authentication" false, "topology.kryo.factory"
"backtype.storm.serialization.DefaultKryoFactory", "worker.gc.childopts" "",
"nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator",
"nimbus.cleanup.inbox.freq.secs" 600, "ui.users" nil,
"transactional.zookeeper.servers" nil, "supervisor.worker.timeout.secs" 30,
"storm.zookeeper.auth.password" nil, "supervisor.childopts" "-Xmx256m",
"ui.filter" nil, "ui.header.buffer.bytes" 4096,
"topology.disruptor.wait.timeout.millis" 1000,
"storm.nimbus.retry.intervalceiling.millis" 60000,
"topology.trident.batch.emit.interval.millis" 500,
"topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy",
"storm.auth.simple-acl.users" [], "drpc.invocations.threads" 64,
"java.library.path" "/usr", "ui.port" 8080,
"storm.messaging.netty.transfer.batch.size" 262144, "logviewer.appender.name"
"A1", "nimbus.thrift.max_buffer_size" 1048576, "nimbus.host" "dimebag",
"storm.auth.simple-acl.users.commands" [], "drpc.request.timeout.secs" 600}
2015-12-16 12:56:36.947 b.s.util [DEBUG] Touching file at
/home/fogetti/downloads/apache-storm-0.10.0/storm-local/workers/d3819964-7671-447d-8763-827ab5cd6140/pids/5076
2015-12-16 12:56:36.962 b.s.d.worker [ERROR] Error on initialization of server
mk-worker
java.io.FileNotFoundException: File
'/home/fogetti/downloads/apache-storm-0.10.0/storm-local/supervisor/stormdist/phish-storm-topology-2-1450237054/stormconf.ser'
does not exist
at
org.apache.storm.shade.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299)
~[storm-core-0.10.0.jar:0.10.0]
at
org.apache.storm.shade.org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763)
~[storm-core-0.10.0.jar:0.10.0]
at
backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:222)
~[storm-core-0.10.0.jar:0.10.0]
at
backtype.storm.daemon.worker$fn__7098$exec_fn__1236__auto____7099.invoke(worker.clj:418)
~[storm-core-0.10.0.jar:0.10.0]
at clojure.lang.AFn.applyToHelper(AFn.java:178) ~[clojure-1.6.0.jar:?]
at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.6.0.jar:?]
at clojure.core$apply.invoke(core.clj:624) ~[clojure-1.6.0.jar:?]
at
backtype.storm.daemon.worker$fn__7098$mk_worker__7175.doInvoke(worker.clj:409)
[storm-core-0.10.0.jar:0.10.0]
at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.6.0.jar:?]
at backtype.storm.daemon.worker$_main.invoke(worker.clj:542)
[storm-core-0.10.0.jar:0.10.0]
at clojure.lang.AFn.applyToHelper(AFn.java:165) [clojure-1.6.0.jar:?]
at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.6.0.jar:?]
at backtype.storm.daemon.worker.main(Unknown Source)
[storm-core-0.10.0.jar:0.10.0]
2015-12-16 12:56:36.984 b.s.util [ERROR] Halting process: ("Error on
initialization")
java.lang.RuntimeException: ("Error on initialization")
at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:336)
[storm-core-0.10.0.jar:0.10.0]
at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.6.0.jar:?]
at
backtype.storm.daemon.worker$fn__7098$mk_worker__7175.doInvoke(worker.clj:409)
[storm-core-0.10.0.jar:0.10.0]
at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.6.0.jar:?]
at backtype.storm.daemon.worker$_main.invoke(worker.clj:542)
[storm-core-0.10.0.jar:0.10.0]
at clojure.lang.AFn.applyToHelper(AFn.java:165) [clojure-1.6.0.jar:?]
at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.6.0.jar:?]
at backtype.storm.daemon.worker.main(Unknown Source)
[storm-core-0.10.0.jar:0.10.0]
{code}
And this is what I see in the nimbus log:
{code}
2015-12-16 12:52:26.929 b.s.d.nimbus [INFO] Delaying event :do-rebalance for 10
secs for phish-storm-topology-2-1450237054
2015-12-16 12:52:27.257 b.s.d.nimbus [INFO] [req 29] Access from: principal:
op:getTopologyConf
2015-12-16 12:52:27.290 b.s.d.nimbus [INFO] [req 28] Access from: principal:
op:getTopologyInfo
2015-12-16 12:52:27.376 b.s.d.nimbus [INFO] [req 34] Access from: principal:
op:getTopology
2015-12-16 12:52:27.378 b.s.d.nimbus [INFO] [req 30] Access from: principal:
op:getTopologyConf
2015-12-16 12:52:28.505 b.s.d.nimbus [INFO] [req 32] Access from: principal:
op:getClusterInfo
2015-12-16 12:52:28.510 b.s.d.nimbus [INFO] [req 31] Access from: principal:
op:getClusterInfo
2015-12-16 12:52:28.524 b.s.d.nimbus [INFO] [req 38] Access from: principal:
op:getClusterInfo
2015-12-16 12:52:28.529 b.s.d.nimbus [INFO] [req 33] Access from: principal:
op:getNimbusConf
2015-12-16 12:52:37.055 b.s.s.EvenScheduler [INFO] Available slots:
(["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701]
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700] ["6c528751-1a10-4c33-bd54-a1
ec9cb26d86" 6701])
2015-12-16 12:52:37.057 b.s.d.nimbus [INFO] Setting new assignment for topology
id phish-storm-topology-2-1450237054:
#backtype.storm.daemon.common.Assignment{:master-code-dir
"/home/fogetti/downloads/apache-storm-0.10.0/storm-local/ni
mbus/stormdist/phish-storm-topology-2-1450237054", :node->host
{"6c528751-1a10-4c33-bd54-a1ec9cb26d86" "petrucci",
"2c80eadd-a7bd-4470-9ba5-d17be81c94ae" "hetfield"}, :executor->node+port {[8 8]
["6c528751-1a10-4c33-bd54-a1ec9cb26d86"
6700], [7 7] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [5 6]
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [3 4]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [9 10]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [2 2]
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6700], [1 1]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]}, :executor->start-time-secs {[9
10] 1450237957, [7 7] 1450237957, [3 4] 1450237957, [1 1] 1450237957, [8 8]
1450237957, [5 6] 1450237957, [2 2] 1450237957}}
...
2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor
phish-storm-topology-2-1450237054:[8 8] not alive
2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor
phish-storm-topology-2-1450237054:[2 2] not alive
2015-12-16 12:54:41.453 b.s.d.nimbus [INFO] Executor
phish-storm-topology-2-1450237054:[5 6] not alive
2015-12-16 12:54:41.457 b.s.s.EvenScheduler [INFO] Available slots:
(["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701]
["6c528751-1a10-4c33-bd54-a1ec9cb26d86" 6701])
2015-12-16 12:54:41.457 b.s.d.nimbus [INFO] Reassigning
phish-storm-topology-2-1450237054 to 2 slots
2015-12-16 12:54:41.457 b.s.d.nimbus [INFO] Reassign executors: [[8 8] [5 6] [2
2]]
2015-12-16 12:54:41.462 b.s.d.nimbus [INFO] Setting new assignment for topology
id phish-storm-topology-2-1450237054:
#backtype.storm.daemon.common.Assignment{:master-code-dir
"/home/fogetti/downloads/apache-storm-0.10.0/storm-local/nimbus/stormdist/phish-storm-topology-2-1450237054",
:node->host {"2c80eadd-a7bd-4470-9ba5-d17be81c94ae" "hetfield"},
:executor->node+port {[8 8] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [7
7] ["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [5 6]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [3 4]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [9 10]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700], [2 2]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6701], [1 1]
["2c80eadd-a7bd-4470-9ba5-d17be81c94ae" 6700]}, :executor->start-time-secs {[7
7] 1450237957, [1 1] 1450237957, [8 8] 1450238081, [2 2] 1450238081, [9 10]
1450237957, [3 4] 1450237957, [5 6] 1450238081}}
{code}
Hope it helps.
> [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does
> not exist
> ------------------------------------------------------------------------------------
>
> Key: STORM-130
> URL: https://issues.apache.org/jira/browse/STORM-130
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 0.9.2-incubating, 0.9.3, 0.9.4
> Reporter: James Xu
> Assignee: Sriharsha Chintalapani
> Priority: Minor
> Fix For: 0.10.0, 0.9.5
>
> Attachments: README.txt, nimbus.log.gz, supervisor.log.2015-08-26,
> supervisor_logs.tar.gz, worker_logs.tar.gz,
> worker_logs_of_kafka_traffic.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-11.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-12.tar.gz,
> worker_logs_of_zookeeper_traffic_2015-04-13.tar.gz,
> workers_with_stormconf.ser.gz
>
>
> https://github.com/nathanmarz/storm/issues/438
> Hi developers,
> We met critical issue with deploying storm topology to our prod cluster.
> After deploying topology we got trace on workers (Storm
> 0.8.2/zookeeper-3.3.6) :
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection,
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181
> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on
> server zookeeper1.company.com/10.72.209.112:2181, sessionid =
> 0x13b3e4b5c780239, negotiated timeout = 20000
> 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
> 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
> 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
> 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection,
> connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm
> sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server
> zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to
> zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on
> server zookeeper1.company.com/10.72.209.112:2181, sessionid =
> 0x13b3e4b5c78023a, negotiated timeout = 20000
> 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
> java.io.FileNotFoundException: File
> '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser'
> does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
> at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
> at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
> at
> backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332)
> at clojure.lang.AFn.applyToHelper(AFn.java:185)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:601)
> at
> backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
> at clojure.lang.RestFn.invoke(RestFn.java:512)
> at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
> at clojure.lang.AFn.applyToHelper(AFn.java:172)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at backtype.storm.daemon.worker.main(Unknown Source)
> 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
> Supervisor trace:
> 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't started
> 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't starte
> .......
> 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26
> still hasn't started
> 2013-01-14 10:59:35 supervisor [INFO] Worker
> d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Worker
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id
> 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175.
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down
> d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id
> d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175.
> State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is
> probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down
> d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> Thanks!
> Oleg M.
> ----------
> xiaokang: We also encountered this problem and the DEBUG log showed that
> storm-code-map is null and new-assignment is not null. I may be that
> new-assignment is got from zk after storm-code-map. So we changed the
> storm-code-map after all-assignment and the problem disapeared.
> (defn mk-synchronize-supervisor [supervisor sync-processes event-manager
> processes-event-manager]
> (fn this []
> (let [conf (:conf supervisor)
> storm-cluster-state (:storm-cluster-state supervisor)
> ^ISupervisor isupervisor (:isupervisor supervisor)
> ^LocalState local-state (:local-state supervisor)
> sync-callback (fn [& ignored] (.add event-manager this))
> storm-code-map (read-storm-code-locations storm-cluster-state
> sync-callback)
> downloaded-storm-ids (set (read-downloaded-storm-ids conf))
> all-assignment (read-assignments
> storm-cluster-state
> (:supervisor-id supervisor)
> sync-callback)
> new-assignment (->> all-assignment
> (filter-key #(.confirmAssigned isupervisor %)))
> assigned-storm-ids (assigned-storm-ids-from-port-assignments
> new-assignment)
> existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
> (log-debug "Synchronizing supervisor")
> (log-debug "Storm code map: " storm-code-map)
> (log-debug "Downloaded storm ids: " downloaded-storm-ids)
> (log-debug "All assignment: " all-assignment)
> (log-debug "New assignment: " new-assignment)
> ----------
> NJtwentyone: I ran into having a similar problem once. Haven't had it lately.
> I was going to investigate more and before I created a posted but... There
> are two schedule-recurring functions of intertest in the function
> (mk-supervisor).
> (schedule-recurring (:timer supervisor) 0 10 (fn ))
> (schedule-recurring (:timer supervisor)
> 0
> (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
> (fn )))
> So function-1 (synchronize-supervisor) will eventually remove that dir
> [../stormconf.ser]
> And function-2 (sync-processes) will eventually hang wanting to create
> workers saying repeatedly
> 2013-01-13 22:13:13 b.s.d.supervisor [INFO]
> 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
> I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac
> and a hp using storm.starter.WordCountTopology
> ----------
> miggi: Well this problem was resolved only after reinstalling new instance of
> ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it
> was decided use new instance ZK.
> Btw: removing storm data folders didn't help
> ----------
> nathanmarz: For those of you hitting the problem, it would be really useful
> if you could turn on DEBUG logging on the supervisor and show those logs when
> the problem is hit.
> ----------
> miggi: After few months stable work - we ran into the same issue again.
> Logs ("topology.debug: true")
> Supervisor logs: http://cl.ly/0p2W3N0A2V3S
> Worker logs: http://cl.ly/2w1j1r0a3X46
> Cleaning/restarting ZK/Nimbus/Worker components didn't help.
> Now we down: haven't tried solution proposed by xiaokang but probably it
> would be our next steps.
> Any fresh ideas ?
> ----------
> devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing
> out Storm stuff in Zk, wiping out Storm data directory, and restarting the
> topology, but it happens again in time.
> Storm 0.9.0-wip16
> Zk 3.3.6
> ----------
> d2r: +1 seen this happen many times. same work-around
> Most recently Storm 0.8.2-wip22
> I'll try to collect a look and debug logs from the supervisor and have a look.
> ----------
> miggi: Hey guys,
> try restart your cluster and ensure that all previous workers (java
> processes) was killed after this operation.
> Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
> Hope it helps
> ----––––
> Oleg M.
> ----------
> devoncrouse: Simple restart doesn't cut it; have to totally clear out all
> Storm state in Zk and data directory (cluster otherwise quickly becomes
> unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
> ----------
> xiaokang: We encountered the same problem in production cluster. The log
> showed that a worker died and supervisor and nimbus found that at the same
> time. Than nimbus assign the worker to other host and the supervisor's
> sync-supervisor delete topo's code while sync-process try to restart the died
> timeouted worker.
> Our solution is change supervisor's woker timeout from 30 to 5 seconds and
> nimbus remain 30. It works.
> ----------
> d2r: We have tried the same work-around from @xiaokang, but we still see the
> issue.
> ----------
> nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let
> me know if it fixes the issue for you.
> ----------
> d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did
> resolve the issue for us. Thanks!
> ----------
> devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0)
> as well?
> d5207b5
> ---------
> viceyang: We encountered the same problem, our version is 0.8.3, i read
> supervisor code, it seems not a bug, the key reason is
> "synchronize-supervisor" which responsible for download file and remove file
> thread and "sync-processes" which responsible for start worker process thread
> is Asynchronous.
> see this case: in synchronize-supervisor read assigment information from zk,
> supervisor download necessary file from nimbus and write local state . in
> aother thread sync-processes funciton read local state to launch workor
> process, when the worker process has not start ,synchronize-supervisor
> function is called again topology's assignment information has changed (cased
> by rebalance,or worker time out etc.) worker assignment to this supervisor
> has move to another supervisor, synchronize-supervisor remove the unnecessary
> file (jar file and ser file etc.) , after this, worker launched by "
> sync-processes" ,ser file was not exsit , this issue occur.
> Though in this issue exception occured but logic sees right. this excetion
> can't compelety Eliminate unless change thread Asynchronous feather.
> there is some way reduction the exception occur, @xiaokang metioned "change
> supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this
> method redeuce nimbus reassign topology. another way change
> "synchronize-supervisor" thread loop time to a longger than 10(default time)
> sec, such as 30 sec。
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)