[ 
https://issues.apache.org/jira/browse/STORM-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959017#comment-16959017
 ] 

Aaron Gresch commented on STORM-3523:
-------------------------------------

It turns out that handling this exception exposed the root cause for us.... 

 

With the exception, the supervisor would restart and fix itself. 

Handling the exception, we sometimes would see workers failing to start.  
Investigating further, we found deadlocks in the AsyncLocalizer due to running 
with only 3 threads.  This was just a setting we carried forward from our older 
pre-2.x clusters.

 

I think it would be best to revert this change and add a comment regarding 
deadlock possibilities.

 

It also appears that running the localizer this way was the actual root cause 
for STORM-3168.  However, the change I made there still seems valid.

 

 

 

 

> supervisor restarts when releasing slot with missing file
> ---------------------------------------------------------
>
>                 Key: STORM-3523
>                 URL: https://issues.apache.org/jira/browse/STORM-3523
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Aaron Gresch
>            Assignee: Aaron Gresch
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.2.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:java}
> 2019-10-03 16:25:32.809 o.a.s.d.s.Slot SLOT_6719 [ERROR] Error when 
> processing event
> java.io.FileNotFoundException: File 
> 'x/storm/supervisor/stormdist/xxx-190213-004131-001-209-1550018519/stormconf.ser'
>  does not exist
>         at 
> org.apache.storm.shade.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:297)
>  ~[shaded-deps-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.shade.org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1851)
>  ~[shaded-deps-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.utils.ConfigUtils.readSupervisorStormConfGivenPath(ConfigUtils.java:308)
>  ~[storm-client-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.utils.ConfigUtils.readSupervisorStormConfImpl(ConfigUtils.java:469)
>  ~[storm-client-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.utils.ConfigUtils.readSupervisorStormConf(ConfigUtils.java:303)
>  ~[storm-client-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.localizer.AsyncLocalizer.getLocalResources(AsyncLocalizer.java:359)
>  ~[storm-server-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.localizer.AsyncLocalizer.releaseSlotFor(AsyncLocalizer.java:460)
>  ~[storm-server-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.daemon.supervisor.Slot.handleWaitingForBlobLocalization(Slot.java:435)
>  ~[storm-server-2.0.1.y.jar:2.0.1.y]
>         at 
> org.apache.storm.daemon.supervisor.Slot.stateMachineStep(Slot.java:229) 
> ~[storm-server-2.0.1.y.jar:2.0.1.y]
>         at org.apache.storm.daemon.supervisor.Slot.run(Slot.java:900) 
> [storm-server-2.0.1.y.jar:2.0.1.y]
> 2019-10-03 16:25:32.810 o.a.s.u.Utils SLOT_6719 [ERROR] Halting process: 
> Error when processing an event
> java.lang.RuntimeException: Halting process: Error when processing an event
>         at org.apache.storm.utils.Utils.exitProcess(Utils.java:550) 
> [storm-client-2.0.1.y.jar:2.0.1.y]
>         at org.apache.storm.daemon.supervisor.Slot.run(Slot.java:947) 
> [storm-server-2.0.1.y.jar:2.0.1.y]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to