[
https://issues.apache.org/jira/browse/STORM-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331822#comment-15331822
]
Kevin Conaway commented on STORM-1879:
--------------------------------------
[~nico.meyer] we're seeing the same thing. Specifically:
{quote}
2016-06-14 18:31:04.465 o.a.s.d.supervisor [INFO] Shutting down
ee56fb9d-2657-4d3f-b52a-1ae4abae85f7:
{quote}
Its shutting down a worker without a worker ID. Below are the logs from one of
our supervisors:
{quote}
2016-06-15 14:18:58.349 o.a.s.d.supervisor [INFO] Shutting down
0274fa9c-c271-4a76-928e-28955db4ee34:
2016-06-15 14:18:58.349 o.a.s.config [INFO] GET worker-user
2016-06-15 14:18:58.350 o.a.s.config [WARN] Failed to get worker user for .
#error {
:cause /var/storm/storm-local/workers-users (Is a directory)
:via
[{:type java.io.FileNotFoundException
:message /var/storm/storm-local/workers-users (Is a directory)
:at [java.io.FileInputStream open0 FileInputStream.java -2]}]
:trace
[[java.io.FileInputStream open0 FileInputStream.java -2]
[java.io.FileInputStream open FileInputStream.java 195]
[java.io.FileInputStream <init> FileInputStream.java 138]
[clojure.java.io$fn__9189 invoke io.clj 229]
[clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
[clojure.java.io$fn__9201 invoke io.clj 258]
[clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
[clojure.java.io$fn__9163 invoke io.clj 165]
[clojure.java.io$fn__9115$G__9091__9122 invoke io.clj 69]
[clojure.java.io$reader doInvoke io.clj 102]
[clojure.lang.RestFn invoke RestFn.java 410]
[clojure.lang.AFn applyToHelper AFn.java 154]
[clojure.lang.RestFn applyTo RestFn.java 132]
[clojure.core$apply invoke core.clj 632]
[clojure.core$slurp doInvoke core.clj 6653]
[clojure.lang.RestFn invoke RestFn.java 410]
[org.apache.storm.config$get_worker_user invoke config.clj 239]
[org.apache.storm.daemon.supervisor$shutdown_worker invoke supervisor.clj 281]
[org.apache.storm.daemon.supervisor$kill_existing_workers_with_change_in_components
invoke supervisor.clj 536]
[org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9078
invoke supervisor.clj 595]
[org.apache.storm.event$event_manager$fn__8630 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-06-15 14:18:58.362 o.a.s.config [INFO] REMOVE worker-user
2016-06-15 14:18:58.362 o.a.s.d.supervisor [INFO] Shut down
0274fa9c-c271-4a76-928e-28955db4ee34:
{quote}
On another supervisor its a different error:
{quote}
2016-06-15 14:17:45.472 o.a.s.d.supervisor [INFO] Worker
4cee7f5c-2c87-48af-a44d-c1568e472f12 failed to start
2016-06-15 14:17:45.473 o.a.s.d.supervisor [INFO] Worker
96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993 failed to start
2016-06-15 14:17:45.504 o.a.s.d.supervisor [INFO] Shutting down and clearing
state for id 414f723a-683c-4fd8-9b57-b8742a2ddade. Current supervisor time:
1466000265. State: :disallowed, Heartbeat: {:time-secs 1466000264, :storm-id
"<snipped>", :executors [[12 12] [54 54] [156 156] [42 42] [72 72] [24 24] [144
144] [162 162] [186 186] [66 66] [120 120] [102 102] [18 18] [6 6] [96 96] [150
150] [48 48] [30 30] [-1 -1] [114 114] [84 84] [90 90] [60 60] [126 126] [138
138] [36 36] [108 108] [180 180] [132 132] [168 168] [78 78] [174 174]], :port
6701}
2016-06-15 14:17:45.504 o.a.s.d.supervisor [INFO] Shutting down
0274fa9c-c271-4a76-928e-28955db4ee34:414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.505 o.a.s.config [INFO] GET worker-user
414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.507 o.a.s.config [WARN] Failed to get worker user for
414f723a-683c-4fd8-9b57-b8742a2ddade. #error {
:cause
/var/storm/storm-local/workers-users/414f723a-683c-4fd8-9b57-b8742a2ddade (No
such file or directory)
:via
[{:type java.io.FileNotFoundException
:message
/var/storm/storm-local/workers-users/414f723a-683c-4fd8-9b57-b8742a2ddade (No
such file or directory)
:at [java.io.FileInputStream open0 FileInputStream.java -2]}]
:trace
[[java.io.FileInputStream open0 FileInputStream.java -2]
[java.io.FileInputStream open FileInputStream.java 195]
[java.io.FileInputStream <init> FileInputStream.java 138]
[clojure.java.io$fn__9189 invoke io.clj 229]
[clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
[clojure.java.io$fn__9201 invoke io.clj 258]
[clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
[clojure.java.io$fn__9163 invoke io.clj 165]
[clojure.java.io$fn__9115$G__9091__9122 invoke io.clj 69]
[clojure.java.io$reader doInvoke io.clj 102]
[clojure.lang.RestFn invoke RestFn.java 410]
[clojure.lang.AFn applyToHelper AFn.java 154]
[clojure.lang.RestFn applyTo RestFn.java 132]
[clojure.core$apply invoke core.clj 632]
[clojure.core$slurp doInvoke core.clj 6653]
[clojure.lang.RestFn invoke RestFn.java 410]
[org.apache.storm.config$get_worker_user invoke config.clj 239]
[org.apache.storm.daemon.supervisor$shutdown_worker invoke supervisor.clj 281]
[org.apache.storm.daemon.supervisor$sync_processes invoke supervisor.clj 427]
[clojure.core$partial$fn__4527 invoke core.clj 2492]
[org.apache.storm.event$event_manager$fn__8630 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-06-15 14:17:45.517 o.a.s.config [INFO] REMOVE worker-user
414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.517 o.a.s.d.supervisor [INFO] Shut down
0274fa9c-c271-4a76-928e-28955db4ee34:414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.518 o.a.s.d.supervisor [INFO] Shutting down and clearing
state for id 2eb07a90-a8b7-4a81-929c-93d6edb3c2ef. Current supervisor time:
1466000265. State: :disallowed, Heartbeat: {:time-secs 1466000265, :storm-id
"<snipped>", :executors [[41 41] [125 125] [137 137] [53 53] [65 65] [101 101]
[149 149] [161 161] [-1 -1] [5 5] [29 29] [89 89] [173 173] [77 77] [185 185]
[113 113] [17 17]], :port 6700}
2016-06-15 14:17:45.518 o.a.s.d.supervisor [INFO] Shutting down
0274fa9c-c271-4a76-928e-28955db4ee34:2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.518 o.a.s.config [INFO] GET worker-user
2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.519 o.a.s.config [WARN] Failed to get worker user for
2eb07a90-a8b7-4a81-929c-93d6edb3c2ef. #error {
:cause
/var/storm/storm-local/workers-users/2eb07a90-a8b7-4a81-929c-93d6edb3c2ef (No
such file or directory)
:via
[{:type java.io.FileNotFoundException
:message
/var/storm/storm-local/workers-users/2eb07a90-a8b7-4a81-929c-93d6edb3c2ef (No
such file or directory)
:at [java.io.FileInputStream open0 FileInputStream.java -2]}]
:trace
[[java.io.FileInputStream open0 FileInputStream.java -2]
[java.io.FileInputStream open FileInputStream.java 195]
[java.io.FileInputStream <init> FileInputStream.java 138]
[clojure.java.io$fn__9189 invoke io.clj 229]
[clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
[clojure.java.io$fn__9201 invoke io.clj 258]
[clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
[clojure.java.io$fn__9163 invoke io.clj 165]
[clojure.java.io$fn__9115$G__9091__9122 invoke io.clj 69]
[clojure.java.io$reader doInvoke io.clj 102]
[clojure.lang.RestFn invoke RestFn.java 410]
[clojure.lang.AFn applyToHelper AFn.java 154]
[clojure.lang.RestFn applyTo RestFn.java 132]
[clojure.core$apply invoke core.clj 632]
[clojure.core$slurp doInvoke core.clj 6653]
[clojure.lang.RestFn invoke RestFn.java 410]
[org.apache.storm.config$get_worker_user invoke config.clj 239]
[org.apache.storm.daemon.supervisor$shutdown_worker invoke supervisor.clj 281]
[org.apache.storm.daemon.supervisor$sync_processes invoke supervisor.clj 427]
[clojure.core$partial$fn__4527 invoke core.clj 2492]
[org.apache.storm.event$event_manager$fn__8630 invoke event.clj 40]
[clojure.lang.AFn run AFn.java 22]
[java.lang.Thread run Thread.java 745]]}
2016-06-15 14:17:45.525 o.a.s.config [INFO] REMOVE worker-user
2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.525 o.a.s.d.supervisor [INFO] Shut down
0274fa9c-c271-4a76-928e-28955db4ee34:2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.526 o.a.s.d.supervisor [INFO] Shutting down and clearing
state for id 96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993. Current supervisor time:
1466000265. State: :not-started, Heartbeat: nil
2016-06-15 14:17:45.526 o.a.s.d.supervisor [INFO] Shutting down
0274fa9c-c271-4a76-928e-28955db4ee34:96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
2016-06-15 14:17:45.526 o.a.s.config [INFO] GET worker-user
96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
2016-06-15 14:17:45.527 o.a.s.config [INFO] REMOVE worker-user
96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
2016-06-15 14:17:45.527 o.a.s.d.supervisor [INFO] Shut down
0274fa9c-c271-4a76-928e-28955db4ee34:96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
{quote}
[~kabhwan] have you seen this before?
> Supervisor may not shut down workers cleanly
> --------------------------------------------
>
> Key: STORM-1879
> URL: https://issues.apache.org/jira/browse/STORM-1879
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 1.0.1
> Reporter: Stig Rohde Døssing
> Attachments: fix_missing_worker_pid.patch, nimbus-supervisor.zip,
> supervisor.log
>
>
> We've run into a strange issue with a zombie worker process. It looks like
> the worker pid file somehow got deleted without the worker process shutting
> down. This causes the supervisor to try repeatedly to kill the worker
> unsuccessfully, and means multiple workers may be assigned to the same port.
> The worker root folder sticks around because the worker is still heartbeating
> to it.
> It may or may not be related that we've seen Nimbus occasionally enter an
> infinite loop of printing logs similar to the below.
> {code}
> 2016-05-19 14:55:14.196 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.210 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.218 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.256 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.273 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.316 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> {code}
> Which continues until Nimbus is rebooted. We also see repeating blocks
> similar to the logs below.
> {code}
> 2016-06-02 07:45:03.656 o.a.s.d.nimbus [INFO] Cleaning up
> ZendeskTicketTopology-127-1464780171
> 2016-06-02 07:45:04.132 o.a.s.d.nimbus [INFO]
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormjar.jar)
> 2016-06-02 07:45:04.144 o.a.s.d.nimbus [INFO]
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormconf.ser)
> 2016-06-02 07:45:04.155 o.a.s.d.nimbus [INFO]
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormcode.ser)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)