[jira] [Updated] (STORM-109) Deploying topology with 540 workers caused nimbus to crash.

Rick Kellogg (JIRA) Thu, 08 Oct 2015 17:39:08 -0700

     [ 
https://issues.apache.org/jira/browse/STORM-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rick Kellogg updated STORM-109:
-------------------------------
    Component/s: storm-core

> Deploying topology with 540 workers caused nimbus to crash.
> -----------------------------------------------------------
>
>                 Key: STORM-109
>                 URL: https://issues.apache.org/jira/browse/STORM-109
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/604
> When deploying a topology to a storm cluster and requesting 540 workers, 
> nimbus entered into a continuous exception, die, restart loop printing this 
> in the logs:
> 2013-06-21 02:14:23,551 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] - 
> Exception causing close of session 0x13d7f3c28867c9f due to 
> java.io.IOException: Len error 1277489
> When the topology was killed by nuking the local disk state for that 
> topology, nimbus recovered itself. When the topology was redeployed with less 
> workers, it did not cause nimbus to fail.
> Probably what's happening is nimbus is trying to create a zknode that is 
> larger than 1MB which is the default max size for a zk node.
> One solution is to increase this threshold in zookeeper to a larger value 
> than 1MB.
> ---------
> d2r: This is exactly what we had to do. We did it by setting jute.maxbuffer 
> property to some higher power of 2.
> I think it happened to us because we launched > 12k workers and all of the 
> assignments were serialized at once and written to a zk node at once, and 
> this constantly exceeded the 1MB buffer. I think we ended up using 
> jute.maxbuffer=4097150.
> EDIT: Should also note that workers/supervisor read this zk node after it is 
> written, and the same buffer issue applies. So the childopts for supervisors 
> and workers need to be set in the same manner in addition to nimbus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (STORM-109) Deploying topology with 540 workers caused nimbus to crash.

Reply via email to