[ 
https://issues.apache.org/jira/browse/HELIX-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joy updated HELIX-535:
----------------------
    Attachment: xaf
                xae
                xad
                xac
                xab
                xaa

The controller log is split into 6 files to workaround the size limit

> Helix controller stops working with heavy configuration
> -------------------------------------------------------
>
>                 Key: HELIX-535
>                 URL: https://issues.apache.org/jira/browse/HELIX-535
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>         Environment: machine:$ uname -a
> Linux eat1-app373.stg 2.6.32-220.10.1.el6.x86_64 #1 SMP Fri Mar 9 12:37:51 
> EST 2012 x86_64 x86_64 x86_64 GNU/Linux
> JVM version: $ /export/apps/jdk/current/bin/java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
>            Reporter: Joy
>         Attachments: xaa, xab, xac, xad, xae, xaf
>
>
> The issue consistently comes up with heavy configuration: higher number of 
> znodes, higher number of partitions, and higher number of databases.
> The goal of our tests is to evaluate the performance of helix controller (in 
> terms of controller latency) with increased number of nodes, databases and 
> partitions.
> In our test, we use multiple machines: one for zookeeper, one for helix 
> controller, and the rest are for dummy processes. The configuration is as 
> below:
>         zkr <----------> helix
>          ^
>          |
>         V
>       dummy processes
> We intentionally kill the master dummy processes once every 30 seconds to 
> simulate a failure event. Everything works fine with light configuration such 
> as: 27 nodes + 1db + 729 partitions. However, when the configuration is 
> heavy, such as 81 nodes + 10 databases + 81 partitions for each db, the 
> controller latency increases significantly after several failure events:
>                   Control Latency (ms)
> First event     : 182
> Second event: 188
> Third event:     200
> Fourth Event:  193
> Fifth event:      200
> Sixth event:     185
> Seventh event: 189
> Eight event:      213
> Ninth Event:     1082209
> And then after this extremely long failure, the helix controller stop 
> working. The controller log is as attached. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to