[ 
https://issues.apache.org/jira/browse/STORM-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331822#comment-15331822
 ] 

Kevin Conaway commented on STORM-1879:
--------------------------------------

[~nico.meyer] we're seeing the same thing. Specifically:

{quote}
2016-06-14 18:31:04.465 o.a.s.d.supervisor [INFO] Shutting down 
ee56fb9d-2657-4d3f-b52a-1ae4abae85f7:
{quote}

Its shutting down a worker without a worker ID.  Below are the logs from one of 
our supervisors:

{quote}
2016-06-15 14:18:58.349 o.a.s.d.supervisor [INFO] Shutting down 
0274fa9c-c271-4a76-928e-28955db4ee34:
2016-06-15 14:18:58.349 o.a.s.config [INFO] GET worker-user
2016-06-15 14:18:58.350 o.a.s.config [WARN] Failed to get worker user for . 
#error {
 :cause /var/storm/storm-local/workers-users (Is a directory)
 :via
 [{:type java.io.FileNotFoundException
   :message /var/storm/storm-local/workers-users (Is a directory)
   :at [java.io.FileInputStream open0 FileInputStream.java -2]}]
 :trace
 [[java.io.FileInputStream open0 FileInputStream.java -2]
  [java.io.FileInputStream open FileInputStream.java 195]
  [java.io.FileInputStream <init> FileInputStream.java 138]
  [clojure.java.io$fn__9189 invoke io.clj 229]
  [clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
  [clojure.java.io$fn__9201 invoke io.clj 258]
  [clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
  [clojure.java.io$fn__9163 invoke io.clj 165]
  [clojure.java.io$fn__9115$G__9091__9122 invoke io.clj 69]
  [clojure.java.io$reader doInvoke io.clj 102]
  [clojure.lang.RestFn invoke RestFn.java 410]
  [clojure.lang.AFn applyToHelper AFn.java 154]
  [clojure.lang.RestFn applyTo RestFn.java 132]
  [clojure.core$apply invoke core.clj 632]
  [clojure.core$slurp doInvoke core.clj 6653]
  [clojure.lang.RestFn invoke RestFn.java 410]
  [org.apache.storm.config$get_worker_user invoke config.clj 239]
  [org.apache.storm.daemon.supervisor$shutdown_worker invoke supervisor.clj 281]
  
[org.apache.storm.daemon.supervisor$kill_existing_workers_with_change_in_components
 invoke supervisor.clj 536]
  [org.apache.storm.daemon.supervisor$mk_synchronize_supervisor$this__9078 
invoke supervisor.clj 595]
  [org.apache.storm.event$event_manager$fn__8630 invoke event.clj 40]
  [clojure.lang.AFn run AFn.java 22]
  [java.lang.Thread run Thread.java 745]]}
2016-06-15 14:18:58.362 o.a.s.config [INFO] REMOVE worker-user
2016-06-15 14:18:58.362 o.a.s.d.supervisor [INFO] Shut down 
0274fa9c-c271-4a76-928e-28955db4ee34:
{quote}

On another supervisor its a different error:

{quote}
2016-06-15 14:17:45.472 o.a.s.d.supervisor [INFO] Worker 
4cee7f5c-2c87-48af-a44d-c1568e472f12 failed to start
2016-06-15 14:17:45.473 o.a.s.d.supervisor [INFO] Worker 
96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993 failed to start
2016-06-15 14:17:45.504 o.a.s.d.supervisor [INFO] Shutting down and clearing 
state for id 414f723a-683c-4fd8-9b57-b8742a2ddade. Current supervisor time: 
1466000265. State: :disallowed, Heartbeat: {:time-secs 1466000264, :storm-id 
"<snipped>", :executors [[12 12] [54 54] [156 156] [42 42] [72 72] [24 24] [144 
144] [162 162] [186 186] [66 66] [120 120] [102 102] [18 18] [6 6] [96 96] [150 
150] [48 48] [30 30] [-1 -1] [114 114] [84 84] [90 90] [60 60] [126 126] [138 
138] [36 36] [108 108] [180 180] [132 132] [168 168] [78 78] [174 174]], :port 
6701}
2016-06-15 14:17:45.504 o.a.s.d.supervisor [INFO] Shutting down 
0274fa9c-c271-4a76-928e-28955db4ee34:414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.505 o.a.s.config [INFO] GET worker-user 
414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.507 o.a.s.config [WARN] Failed to get worker user for 
414f723a-683c-4fd8-9b57-b8742a2ddade. #error {
 :cause 
/var/storm/storm-local/workers-users/414f723a-683c-4fd8-9b57-b8742a2ddade (No 
such file or directory)
 :via
 [{:type java.io.FileNotFoundException
   :message 
/var/storm/storm-local/workers-users/414f723a-683c-4fd8-9b57-b8742a2ddade (No 
such file or directory)
   :at [java.io.FileInputStream open0 FileInputStream.java -2]}]
 :trace
 [[java.io.FileInputStream open0 FileInputStream.java -2]
  [java.io.FileInputStream open FileInputStream.java 195]
  [java.io.FileInputStream <init> FileInputStream.java 138]
  [clojure.java.io$fn__9189 invoke io.clj 229]
  [clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
  [clojure.java.io$fn__9201 invoke io.clj 258]
  [clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
  [clojure.java.io$fn__9163 invoke io.clj 165]
  [clojure.java.io$fn__9115$G__9091__9122 invoke io.clj 69]
  [clojure.java.io$reader doInvoke io.clj 102]
  [clojure.lang.RestFn invoke RestFn.java 410]
  [clojure.lang.AFn applyToHelper AFn.java 154]
  [clojure.lang.RestFn applyTo RestFn.java 132]
  [clojure.core$apply invoke core.clj 632]
  [clojure.core$slurp doInvoke core.clj 6653]
  [clojure.lang.RestFn invoke RestFn.java 410]
  [org.apache.storm.config$get_worker_user invoke config.clj 239]
  [org.apache.storm.daemon.supervisor$shutdown_worker invoke supervisor.clj 281]
  [org.apache.storm.daemon.supervisor$sync_processes invoke supervisor.clj 427]
  [clojure.core$partial$fn__4527 invoke core.clj 2492]
  [org.apache.storm.event$event_manager$fn__8630 invoke event.clj 40]
  [clojure.lang.AFn run AFn.java 22]
  [java.lang.Thread run Thread.java 745]]}
2016-06-15 14:17:45.517 o.a.s.config [INFO] REMOVE worker-user 
414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.517 o.a.s.d.supervisor [INFO] Shut down 
0274fa9c-c271-4a76-928e-28955db4ee34:414f723a-683c-4fd8-9b57-b8742a2ddade
2016-06-15 14:17:45.518 o.a.s.d.supervisor [INFO] Shutting down and clearing 
state for id 2eb07a90-a8b7-4a81-929c-93d6edb3c2ef. Current supervisor time: 
1466000265. State: :disallowed, Heartbeat: {:time-secs 1466000265, :storm-id 
"<snipped>", :executors [[41 41] [125 125] [137 137] [53 53] [65 65] [101 101] 
[149 149] [161 161] [-1 -1] [5 5] [29 29] [89 89] [173 173] [77 77] [185 185] 
[113 113] [17 17]], :port 6700}
2016-06-15 14:17:45.518 o.a.s.d.supervisor [INFO] Shutting down 
0274fa9c-c271-4a76-928e-28955db4ee34:2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.518 o.a.s.config [INFO] GET worker-user 
2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.519 o.a.s.config [WARN] Failed to get worker user for 
2eb07a90-a8b7-4a81-929c-93d6edb3c2ef. #error {
 :cause 
/var/storm/storm-local/workers-users/2eb07a90-a8b7-4a81-929c-93d6edb3c2ef (No 
such file or directory)
 :via
 [{:type java.io.FileNotFoundException
   :message 
/var/storm/storm-local/workers-users/2eb07a90-a8b7-4a81-929c-93d6edb3c2ef (No 
such file or directory)
   :at [java.io.FileInputStream open0 FileInputStream.java -2]}]
 :trace
 [[java.io.FileInputStream open0 FileInputStream.java -2]
  [java.io.FileInputStream open FileInputStream.java 195]
  [java.io.FileInputStream <init> FileInputStream.java 138]
  [clojure.java.io$fn__9189 invoke io.clj 229]
  [clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
  [clojure.java.io$fn__9201 invoke io.clj 258]
  [clojure.java.io$fn__9102$G__9095__9109 invoke io.clj 69]
  [clojure.java.io$fn__9163 invoke io.clj 165]
  [clojure.java.io$fn__9115$G__9091__9122 invoke io.clj 69]
  [clojure.java.io$reader doInvoke io.clj 102]
  [clojure.lang.RestFn invoke RestFn.java 410]
  [clojure.lang.AFn applyToHelper AFn.java 154]
  [clojure.lang.RestFn applyTo RestFn.java 132]
  [clojure.core$apply invoke core.clj 632]
  [clojure.core$slurp doInvoke core.clj 6653]
  [clojure.lang.RestFn invoke RestFn.java 410]
  [org.apache.storm.config$get_worker_user invoke config.clj 239]
  [org.apache.storm.daemon.supervisor$shutdown_worker invoke supervisor.clj 281]
  [org.apache.storm.daemon.supervisor$sync_processes invoke supervisor.clj 427]
  [clojure.core$partial$fn__4527 invoke core.clj 2492]
  [org.apache.storm.event$event_manager$fn__8630 invoke event.clj 40]
  [clojure.lang.AFn run AFn.java 22]
  [java.lang.Thread run Thread.java 745]]}
2016-06-15 14:17:45.525 o.a.s.config [INFO] REMOVE worker-user 
2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.525 o.a.s.d.supervisor [INFO] Shut down 
0274fa9c-c271-4a76-928e-28955db4ee34:2eb07a90-a8b7-4a81-929c-93d6edb3c2ef
2016-06-15 14:17:45.526 o.a.s.d.supervisor [INFO] Shutting down and clearing 
state for id 96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993. Current supervisor time: 
1466000265. State: :not-started, Heartbeat: nil
2016-06-15 14:17:45.526 o.a.s.d.supervisor [INFO] Shutting down 
0274fa9c-c271-4a76-928e-28955db4ee34:96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
2016-06-15 14:17:45.526 o.a.s.config [INFO] GET worker-user 
96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
2016-06-15 14:17:45.527 o.a.s.config [INFO] REMOVE worker-user 
96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
2016-06-15 14:17:45.527 o.a.s.d.supervisor [INFO] Shut down 
0274fa9c-c271-4a76-928e-28955db4ee34:96f8b5d5-a4e0-4fe2-90a8-7de85a5dd993
{quote}

[~kabhwan] have you seen this before?

> Supervisor may not shut down workers cleanly
> --------------------------------------------
>
>                 Key: STORM-1879
>                 URL: https://issues.apache.org/jira/browse/STORM-1879
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.1
>            Reporter: Stig Rohde Døssing
>         Attachments: fix_missing_worker_pid.patch, nimbus-supervisor.zip, 
> supervisor.log
>
>
> We've run into a strange issue with a zombie worker process. It looks like 
> the worker pid file somehow got deleted without the worker process shutting 
> down. This causes the supervisor to try repeatedly to kill the worker 
> unsuccessfully, and means multiple workers may be assigned to the same port. 
> The worker root folder sticks around because the worker is still heartbeating 
> to it.
> It may or may not be related that we've seen Nimbus occasionally enter an 
> infinite loop of printing logs similar to the below.
> {code}
> 2016-05-19 14:55:14.196 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.210 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.218 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.256 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.273 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.316 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> {code}
> Which continues until Nimbus is rebooted. We also see repeating blocks 
> similar to the logs below.
> {code}
> 2016-06-02 07:45:03.656 o.a.s.d.nimbus [INFO] Cleaning up 
> ZendeskTicketTopology-127-1464780171
> 2016-06-02 07:45:04.132 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormjar.jar)
> 2016-06-02 07:45:04.144 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormconf.ser)
> 2016-06-02 07:45:04.155 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormcode.ser)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to