[
https://issues.apache.org/jira/browse/STORM-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-109:
-------------------------------
Component/s: storm-core
> Deploying topology with 540 workers caused nimbus to crash.
> -----------------------------------------------------------
>
> Key: STORM-109
> URL: https://issues.apache.org/jira/browse/STORM-109
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Reporter: James Xu
> Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/604
> When deploying a topology to a storm cluster and requesting 540 workers,
> nimbus entered into a continuous exception, die, restart loop printing this
> in the logs:
> 2013-06-21 02:14:23,551 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] -
> Exception causing close of session 0x13d7f3c28867c9f due to
> java.io.IOException: Len error 1277489
> When the topology was killed by nuking the local disk state for that
> topology, nimbus recovered itself. When the topology was redeployed with less
> workers, it did not cause nimbus to fail.
> Probably what's happening is nimbus is trying to create a zknode that is
> larger than 1MB which is the default max size for a zk node.
> One solution is to increase this threshold in zookeeper to a larger value
> than 1MB.
> ---------
> d2r: This is exactly what we had to do. We did it by setting jute.maxbuffer
> property to some higher power of 2.
> I think it happened to us because we launched > 12k workers and all of the
> assignments were serialized at once and written to a zk node at once, and
> this constantly exceeded the 1MB buffer. I think we ended up using
> jute.maxbuffer=4097150.
> EDIT: Should also note that workers/supervisor read this zk node after it is
> written, and the same buffer issue applies. So the childopts for supervisors
> and workers need to be set in the same manner in addition to nimbus.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)