Joy created HELIX-535:
-------------------------
Summary: Helix controller stops working with heavy configuration
Key: HELIX-535
URL: https://issues.apache.org/jira/browse/HELIX-535
Project: Apache Helix
Issue Type: Bug
Components: helix-core
Environment: machine:$ uname -a
Linux eat1-app373.stg 2.6.32-220.10.1.el6.x86_64 #1 SMP Fri Mar 9 12:37:51 EST
2012 x86_64 x86_64 x86_64 GNU/Linux
JVM version: $ /export/apps/jdk/current/bin/java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
Reporter: Joy
The issue consistently comes up with heavy configuration: higher number of
znodes, higher number of partitions, and higher number of databases.
The goal of our tests is to evaluate the performance of helix controller (in
terms of controller latency) with increased number of nodes, databases and
partitions.
In our test, we use multiple machines: one for zookeeper, one for helix
controller, and the rest are for dummy processes. The configuration is as below:
zkr <----------> helix
^
|
V
dummy processes
We intentionally kill the master dummy processes once every 30 seconds to
simulate a failure event. Everything works fine with light configuration such
as: 27 nodes + 1db + 729 partitions. However, when the configuration is heavy,
such as 81 nodes + 10 databases + 81 partitions for each db, the controller
latency increases significantly after several failure events:
Control Latency (ms)
First event : 182
Second event: 188
Third event: 200
Fourth Event: 193
Fifth event: 200
Sixth event: 185
Seventh event: 189
Eight event: 213
Ninth Event: 1082209
And then after this extremely long failure, the helix controller stop working.
The controller log is as attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)