[jira] [Created] (HELIX-584) SimpleDateFormat should not be used as singleton due to its race conditions

2015-03-20 Thread Lei Xia (JIRA)
Lei Xia created HELIX-584:
-

 Summary: SimpleDateFormat should not be used as singleton due to 
its race conditions
 Key: HELIX-584
 URL: https://issues.apache.org/jira/browse/HELIX-584
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia


SimpleDateFormat is used in workflowConfig as a singleton. But since it is not 
thread-safe (refer here: 
http://www.hpenterprisesecurity.com/vulncat/en/vulncat/java/race_condition_format_flaw.html),
 it will mess up the output date format sometime due to race condition. 


An example trace stack for such failure:

Message:
For input string: 2003.E2003E22

Full Stacktrace:
java.lang.NumberFormatException: For input string: 2003.E2003E22
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222)
at java.lang.Double.parseDouble(Double.java:510)
at java.text.DigitList.getDouble(DigitList.java:151)
at java.text.DecimalFormat.parse(DecimalFormat.java:1302)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1935)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1311)
at java.text.DateFormat.parse(DateFormat.java:335)
at org.apache.helix.task.TaskUtil.parseScheduleFromConfigMap(TaskUtil.java:365)
at org.apache.helix.task.WorkflowConfig$Builder.fromMap(WorkflowConfig.java:173)
at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:113)
at org.apache.helix.task.TaskUtil.getWorkflowCfg(TaskUtil.java:126)
at org.apache.helix.integration.task.TestUtil.pollForJobState(TestUtil.java:61)
at 
org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopDeleteJobAndResumeRecurrentQueue(TestTaskRebalancerStopResume.java:420)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:76)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:673)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:846)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1170)
at 
org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
at org.testng.TestRunner.runWorkers(TestRunner.java:1147)
at org.testng.TestRunner.privateRun(TestRunner.java:749)
at org.testng.TestRunner.run(TestRunner.java:600)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:317)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:312)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:274)
at org.testng.SuiteRunner.run(SuiteRunner.java:223)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1039)
at org.testng.TestNG.runSuitesLocally(TestNG.java:964)
at org.testng.TestNG.run(TestNG.java:900)
at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:178)
at 
org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute(TestNGXmlTestSuite.java:92)
at 
org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158)
at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-586) Race condition could happen when invokeRebalance() in TaskDriver try to touch IdealState

2015-03-20 Thread Lei Xia (JIRA)
Lei Xia created HELIX-586:
-

 Summary: Race condition could happen when invokeRebalance() in 
TaskDriver try to touch IdealState
 Key: HELIX-586
 URL: https://issues.apache.org/jira/browse/HELIX-586
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia


invokeRebalance() in TaskDriver is a hack to touch idealstate until bug 
concerning resource config changes not driving rebalance is fixed.

There is possible race condition here, if idealstate is deleted by taskBalancer 
before, it will generate a null/empty znode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-599) Support creating/maintaining/routing resources with same names in different instance groups

2015-06-04 Thread Lei Xia (JIRA)
Lei Xia created HELIX-599:
-

 Summary: Support creating/maintaining/routing resources with same 
names in different instance groups
 Key: HELIX-599
 URL: https://issues.apache.org/jira/browse/HELIX-599
 Project: Apache Helix
  Issue Type: New Feature
  Components: helix-core, helix-webapp-admin
Reporter: Lei Xia
Assignee: Lei Xia


In LinkedIn, we have a new use scenario that there will be multiple databases 
sitting in the same Helix cluster with the same name, but on different instance 
groups.  What we need are:

1) Allow resources (databases) with the same name, these resources are on 
different instance groups (with different tags).

2) Routing table (Spectator) is able to aggregate and return all instance (from 
multiple instance groups) that hold the database with given name.

Our proposed solution is:

1) Add a Resource Group field in IdealState for the databases with the same 
names from different instance groups

2) Use Instance Group Tag (or new Resource Tag) to differentiate databases 
(with same name) from different instance groups.

3) Use name mangling for Idealstate, for example, with database TestDB in 
instance group testGroup, the IdealState and ExternalView id would be 
TestDB$testGroup. 

4) Change Helix Routing Table to be able to aggregate databases from the same 
resource group.
 

Four new APIs are going to be added to RoutingTableProvider:

public class RoutingTableProvider {
 
/**
 * returns the instances that contain the given partition in a specific state 
from all resources with given resource name
 */
public ListInstanceConfig getInstances(String resource, String partition, 
String state);
 
/**
 * returns the instances that contain the given partition in a specific state 
from selected resources with given name and tags
 */
public ListInstanceConfig getInstances(String resource, String partition, 
String state, ListString resourceTags);
 
/**
 * returns instances that contain given resource that are in a specific state
 */
public SetInstanceConfig getInstances(String resource, String state);
 
/**
 * returns instances that contain given resource with tags that are in a 
specific state
 */
public SetInstanceConfig getInstances(String resource, String state,  
ListString groupTags);
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-568) Make Helix rebalancer rackaware

2015-06-08 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577572#comment-14577572
 ] 

Lei Xia commented on HELIX-568:
---

We are discussion this feature for our Linkedin internal use cases, and may 
start working on this in this quarter.

 Make Helix rebalancer rackaware
 ---

 Key: HELIX-568
 URL: https://issues.apache.org/jira/browse/HELIX-568
 Project: Apache Helix
  Issue Type: Bug
Reporter: Zhen Zhang
Assignee: Lei Xia

 Currently Helix doesn't support rack awareness in the rebalancer. There are 
 scenarios where rack aware is necessary. We should think about providing 
 primitives and default rebalancing strategy to support rack awareness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-600) Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp

2015-06-09 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579577#comment-14579577
 ] 

Lei Xia commented on HELIX-600:
---

To me more specific, this only happens when you delete an existing queue, and 
recreate it.

- stop the queue
- delete the queue (using taskDriver.delete(queueName) API )
- create a queue with new startTime recurrence parameter

If new startTime is a value in the future (say current-time + 5 minutes), Helix 
is not scheduling the jobs event after startTime timestamp elapses.


 Task scheduler fails to schedule a recurring workflow if the startTime is set 
 to a future timestamp
 ---

 Key: HELIX-600
 URL: https://issues.apache.org/jira/browse/HELIX-600
 Project: Apache Helix
  Issue Type: Bug
Affects Versions: 0.6.3, 0.6.4
Reporter: Karthiek
Assignee: Lei Xia

 If we define a recurrent job queue with start-time value in the future (say 
 current time + 5 minutes), Helix does not schedule the queue event after 
 start-time timestamp elapses. Helix should schedule jobs once the recurrence 
 timestamp is hit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HELIX-600) Task scheduler fails to schedule a recurring workflow if the startTime is set to a future timestamp

2015-06-09 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579577#comment-14579577
 ] 

Lei Xia edited comment on HELIX-600 at 6/9/15 11:23 PM:


To be more specific, this only happens when you delete an existing queue, and 
recreate it.

- stop the queue
- delete the queue (using taskDriver.delete(queueName) API )
- create a queue with new startTime recurrence parameter

If new startTime is a value in the future (say current-time + 5 minutes), Helix 
is not scheduling the jobs event after startTime timestamp elapses.



was (Author: andrewlxia):
To me more specific, this only happens when you delete an existing queue, and 
recreate it.

- stop the queue
- delete the queue (using taskDriver.delete(queueName) API )
- create a queue with new startTime recurrence parameter

If new startTime is a value in the future (say current-time + 5 minutes), Helix 
is not scheduling the jobs event after startTime timestamp elapses.


 Task scheduler fails to schedule a recurring workflow if the startTime is set 
 to a future timestamp
 ---

 Key: HELIX-600
 URL: https://issues.apache.org/jira/browse/HELIX-600
 Project: Apache Helix
  Issue Type: Bug
Affects Versions: 0.6.3, 0.6.4
Reporter: Karthiek
Assignee: Lei Xia

 If we define a recurrent job queue with start-time value in the future (say 
 current time + 5 minutes), Helix does not schedule the queue event after 
 start-time timestamp elapses. Helix should schedule jobs once the recurrence 
 timestamp is hit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-622) Add new resource configuration option to allow resource to disable emmiting monitoring bean.

2016-01-08 Thread Lei Xia (JIRA)
Lei Xia created HELIX-622:
-

 Summary: Add new resource configuration option to allow resource 
to disable emmiting monitoring bean.
 Key: HELIX-622
 URL: https://issues.apache.org/jira/browse/HELIX-622
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


Helix creates a set of metrics for each resource. Since job is treated as a 
regular resource by Helix, each job will emit a set of new metrics to ingraph.  
But these metrics are dynamic date metrics, most of them are empty, it is 
meaningless to put any alerts on them, they are barely used in practice. 

On the other hand, however, we still need some stable metrics (fix set of 
metric names) for operational team to monitor the queue and job running status.

For short term solution, we can add an option in JobConfig to enable emitting a 
metric for this job, by default, this is disabled.  As a next step, we will 
need to add a new set of metrics for jobs and workflows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-623) Do not expose internal configuration field name. Client should use JobConfig.Builder to create jobConfig.

2016-01-08 Thread Lei Xia (JIRA)
Lei Xia created HELIX-623:
-

 Summary: Do not expose internal configuration field name. Client 
should use JobConfig.Builder to create jobConfig.
 Key: HELIX-623
 URL: https://issues.apache.org/jira/browse/HELIX-623
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


Do not expose internal configuration field name. Client should use 
JobConfig.Builder to create jobConfig.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-618) Job hung if the target resource does not exist anymore at the time when it is scheduled.

2015-12-22 Thread Lei Xia (JIRA)
Lei Xia created HELIX-618:
-

 Summary:  Job hung if the target resource does not exist anymore 
at the time when it is scheduled.
 Key: HELIX-618
 URL: https://issues.apache.org/jira/browse/HELIX-618
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


 When the job gets scheduled, if the target resource does not exist any more 
(for example, database already deleted but the backup job is still there), the 
job is stuck and all the rest of jobs in the same workflow are stuck.

Solution:

If the target resource of a job does not exist, the job should be failed 
immediately. And depends on the queue configuration, the rest of jobs will 
either continue to run, or the queue will be marked as failed.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-615) Naming problem of scheduled jobs from recurrent queue.

2015-11-20 Thread Lei Xia (JIRA)
Lei Xia created HELIX-615:
-

 Summary: Naming problem of scheduled jobs from recurrent queue.
 Key: HELIX-615
 URL: https://issues.apache.org/jira/browse/HELIX-615
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


Problem: 
Jobs from a recurrent queue are named with original job + index (for example, 
if job named backup_job_testdb, the first scheduled job is called 
backup_job_testdb_0, the second time it gets schedule will be named 
backup_job_testdb_1, etc.  This will create name conflict if the workflow is 
deleted and recreated recently.

Proposed Change:
Jobs from a recurring are named with original job + schedule_time (for example, 
if job named backup_job_testdb, scheduled job will be named 
backup_job_testdb_20151028T230011Z (Date follows ISO 8601 format).  This will 
avoid name conflict even if the workflow is deleted and recreated again. 
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-614) Fix the bug when job expiry time is shorter than job schedule interval in recurring job queue.

2015-11-20 Thread Lei Xia (JIRA)
Lei Xia created HELIX-614:
-

 Summary: Fix the bug when job expiry time is shorter than job 
schedule interval in recurring job queue.
 Key: HELIX-614
 URL: https://issues.apache.org/jira/browse/HELIX-614
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


When job expiry time is shorter than job schedule interval in recurring job 
queue. The lastScheduled workflow context will be null, this cause an npe and 
block the following workflow schedule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-616) Change JobQueue to be subclass of Workflow instead of WorkflowConfig.

2015-11-20 Thread Lei Xia (JIRA)
Lei Xia created HELIX-616:
-

 Summary: Change JobQueue to be subclass of Workflow instead of 
WorkflowConfig.
 Key: HELIX-616
 URL: https://issues.apache.org/jira/browse/HELIX-616
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


Currently, JobQueue is subclass of WorkflowConfig instead of Workflow.  It is 
not possible for client to create an initial queue with jobs in it.  It has to 
call create() to create an empty queue, and call enqueue() to add job to the 
queue. The problem is once create() call returns, the queue is already starts 
to run, if the queue is recurrent queue, the initial schedule run of the queue 
contains empty job set.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-559) Helix web admin performance issues

2016-03-31 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220741#comment-15220741
 ] 

Lei Xia commented on HELIX-559:
---

We are still seeing this issue in our production, it took 40s - 1min to read an 
resource externalview from Helix web admin, but it took only 2s to read 
directly from zk.  I am opening the ticket, and we need more investigation on 
this.

> Helix web admin performance issues
> --
>
> Key: HELIX-559
> URL: https://issues.apache.org/jira/browse/HELIX-559
> Project: Apache Helix
>  Issue Type: Bug
>Affects Versions: 0.6.4
>Reporter: Zhen Zhang
>Assignee: Zhen Zhang
> Fix For: 0.6.5
>
>
> Helix web admin has a couple of performance issues:
> - Use restlet default server which is slow
> Need to switch to use jetty
> - Unnecessary json deserialization/serialization
> For reading idealStates from helix web admin, we read as ZNRecord, serialize 
> ZNRecord to byte arrays, and return the result. It's not necessary to do the 
> der/ser which costs lots of CPU cycles. Instead, we can read the raw data as 
> byte arrays and return it directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-631) AutoRebalanceStrategy does not work correctly all the time

2016-08-09 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414352#comment-15414352
 ] 

Lei Xia commented on HELIX-631:
---

Confirmed that this can be reproduced.  The following preference list was 
generated (with partition 9 missing one replica).

{0=[node-1, node-2, node-3], 
1=[node-4, node-2, node-3], 
10=[node-1, node-4, node-2], 
11=[node-1, node-3, node-2], 
12=[node-4, node-3, node-2], 
13=[node-1, node-4, node-3], 
14=[node-1, node-4, node-2], 
15=[node-1, node-3, node-2], 
2=[node-4, node-1, node-3], 
3=[node-4, node-1, node-2],
 4=[node-3, node-1, node-2], 
5=[node-4, node-3, node-2], 
6=[node-4, node-1, node-3],
7=[node-4, node-1, node-2], 
8=[node-3, node-1, node-2], 
9=[node-4, node-3]}

> AutoRebalanceStrategy does not work correctly all the time
> --
>
> Key: HELIX-631
> URL: https://issues.apache.org/jira/browse/HELIX-631
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Subbu
>
> I have 16 partitions, 3 replicas each, and 4 instances to distribute these 
> on. The auto-rebalancer assigns only 2 replicas for one of the partitions.
> Here is the code snippet to reproduce the problem
> {code}
> final String resourceName = "something";
> final List instanceNames = null; // Initialize to 4 unique strings
> final int nReplicas = 3;
> List partitions = new ArrayList<>(nPartitions);
> for (int i = 0; i < nPartitions; i++) {
>   partitions.add(Integer.toString(i));
> }
> LinkedHashMap states = new LinkedHashMap<>(2);
> states.put("OFFLINE", 0);
> states.put("ONLINE", nReplicas);
> AutoRebalanceStrategy strategy = new AutoRebalanceStrategy(resourceName, 
> partitions, states);
> ZNRecord znRecord = strategy.computePartitionAssignment(instanceNames, 
> new HashMap>(0), instanceNames);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-527) Mitigate zookeeper watch leak

2016-08-31 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452861#comment-15452861
 ] 

Lei Xia commented on HELIX-527:
---

We may need to reevaluate this issue and figure out the solution.  Assigned to 
me, and I will spend some time on this issue.

> Mitigate zookeeper watch leak
> -
>
> Key: HELIX-527
> URL: https://issues.apache.org/jira/browse/HELIX-527
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Zhen Zhang
>
> On investigating zookeeper watch leakage problem, it turns out to be a 
> zookeeper issue:
> https://issues.apache.org/jira/browse/ZOOKEEPER-442
> For zookeeper before 3.5.0, we can't remove watches that are no longer of 
> interests. The only way to remove a watch is to trigger it; that is, if it is 
> a DataWatch, we need to trigger a data change on the watching path, or if it 
> is a ChildWatch, we need to trigger a child change on the watching path. 
> Unfortunately, if we are watching a path that has been deleted, unless we 
> re-create the path, there is no way we can remove the watch.
> Here are some of the most common scenarios where we will have dead zookeeper 
> watches on zookeeper server side even though we unregister all the listeners 
> on the zookeeper client side:
> - When we drop a resource group from a cluster, we may have dead watches on 
> ideal-state, participant current-state, and external-view
> - When we remove an instance from a cluster, we may have dead watches on 
> current-state, participant-config, and participant messages
> - When we use property store with caches enabled by zookeeper watches, we may 
> have dead watches on all removed paths



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-634) Refactor AutoRebalancer to allow configuable placement strategy.

2016-09-12 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485814#comment-15485814
 ] 

Lei Xia commented on HELIX-634:
---

Code already checked in to 0.6.x branch. 
https://github.com/apache/helix/commit/ea0fbbbce302974b88a2b8253bf06616fd91aa5b

> Refactor AutoRebalancer to allow configuable placement strategy.
> 
>
> Key: HELIX-634
> URL: https://issues.apache.org/jira/browse/HELIX-634
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Lei Xia
>Assignee: Lei Xia
>
> With the new auto-rebalancer design, we will need to separate rebalancer with 
> the actual placement computing strategy. This task is to clean up and 
> refactor current rebalancer code structure to allow it to plug into different 
> placement computing strategy.
> AC:
> - Rebalancer with pluggable placement strategy
> - Placement Strategy is configurable in Ideal-state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-633) AutoRebalancer should ignore disabled instance and all partitions on disabled instances should be dropped in FULL_AUTO rebalance mode.

2016-09-12 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485816#comment-15485816
 ] 

Lei Xia commented on HELIX-633:
---

Code checked in to 0.6.x branch. 
https://github.com/apache/helix/commit/bc0aa76a9de6243928e53e1a1d01e7502ff8267c

> AutoRebalancer should ignore disabled instance and all partitions on disabled 
> instances should be dropped in FULL_AUTO rebalance mode.
> --
>
> Key: HELIX-633
> URL: https://issues.apache.org/jira/browse/HELIX-633
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Lei Xia
>Assignee: Lei Xia
>
> Now, if a node is disabled in cluster but stays to appear in LIVEINSTANCES.
> Helix AutoRebalancer still assign partitions on the disabled node, but not 
> sending state transition messages to the node. So all partitions which 
> belonged to that node disappeared from EV.
> The right behavior is that AutoRebalancer should ignore the
> disabled node or the node should disappear from live instances.
> After this fix, in Helix Full-Auto rebalance mode, if an instance is disabled 
> from its config, no matter it is live or not, the partitions on the disabled 
> node will be transferred to DROPPED state, and the partitions will be 
> redistributed to all other live and enabled instances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-634) Refactor AutoRebalancer to allow configuable placement strategy.

2016-08-28 Thread Lei Xia (JIRA)
Lei Xia created HELIX-634:
-

 Summary: Refactor AutoRebalancer to allow configuable placement 
strategy.
 Key: HELIX-634
 URL: https://issues.apache.org/jira/browse/HELIX-634
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia


With the new auto-rebalancer design, we will need to separate rebalancer with 
the actual placement computing strategy. This task is to clean up and refactor 
current rebalancer code structure to allow it to plug into different placement 
computing strategy.

AC:

- Rebalancer with pluggable placement strategy
- Placement Strategy is configurable in Ideal-state.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-633) AutoRebalancer should ignore disabled instance and all partitions on disabled instances should be dropped in FULL_AUTO rebalance mode.

2016-08-28 Thread Lei Xia (JIRA)
Lei Xia created HELIX-633:
-

 Summary: AutoRebalancer should ignore disabled instance and all 
partitions on disabled instances should be dropped in FULL_AUTO rebalance mode.
 Key: HELIX-633
 URL: https://issues.apache.org/jira/browse/HELIX-633
 Project: Apache Helix
  Issue Type: Bug
Reporter: Lei Xia
Assignee: Lei Xia


Now, if a node is disabled in cluster but stays to appear in LIVEINSTANCES.
Helix AutoRebalancer still assign partitions on the disabled node, but not 
sending state transition messages to the node. So all partitions which belonged 
to that node disappeared from EV.

The right behavior is that AutoRebalancer should ignore the
disabled node or the node should disappear from live instances.

After this fix, in Helix Full-Auto rebalance mode, if an instance is disabled 
from its config, no matter it is live or not, the partitions on the disabled 
node will be transferred to DROPPED state, and the partitions will be 
redistributed to all other live and enabled instances.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-543) Single partition unnecessarily moved

2016-10-25 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605838#comment-15605838
 ] 

Lei Xia commented on HELIX-543:
---

This patch is ported to 0.6.x branch, will be included in our next release.


> Single partition unnecessarily moved
> 
>
> Key: HELIX-543
> URL: https://issues.apache.org/jira/browse/HELIX-543
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Affects Versions: 0.7.1, 0.6.4
>Reporter: Tom Widmer
>Assignee: kishore gopalakrishna
>Priority: Minor
>
> (Copied from mailing list)
> I have some resources that I use with the OnlineOffine state but which only 
> have a single partition at the moment (essentially, Helix is just giving me a 
> simple leader election to decide who controls the resource - I don’t care 
> which participant has it, as long as only one does). However, with full auto 
> rebalance, I find that the ‘first’ instance (alphabetically I think) always 
> gets the resource when it’s up. So if I take down the first node so the 
> partition transfers to the 2nd node, then bring back up the 1st node, the 
> resource transfers back unnecessarily.
> Note that this issue also affects multi-partition resources, it’s just a bit 
> less noticeable (it means that with 3 nodes and 4 partitions, say, the 
> partitions are always allocated 2, 1, 1, so if you have node 1 down and hence 
> 0, 2, 2, and then bring up node 1, it unnecessarily moves 2 partitions to 
> make 2, 1, 1 rather than the minimum move to achieve ‘balance’ which would be 
> to move 1 partition from instance 2 or 3 back to instance 1.
> I can see the code in question in 
> AutoRebalanceStrategy.typedComputePartitionAssignment, where the 
> distRemainder is allocated to the first nodes alphabetically, so that the 
> capacity of all nodes is not equal.
> The proposed solution is to sort the nodes by the number of partitions they 
> already have assigned, which should mean that those nodes are assigned the 
> higher capacity and the problem goes away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-598) ZkClientPool NPE after closing a connection

2016-11-17 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674337#comment-15674337
 ] 

Lei Xia commented on HELIX-598:
---

I will take a look at this issue.

> ZkClientPool NPE after closing a connection
> ---
>
> Key: HELIX-598
> URL: https://issues.apache.org/jira/browse/HELIX-598
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Jemiah Westerman
>Assignee: Lei Xia
>
> If a client gets a connection from ZkClientPool and then closes the 
> connection by calling zkClient.close(), any future calls to getZkClient will 
> NPE. The pool attempts to check if the connection is valid by calling 
> "zkClient.getConnection().getZookeeperState() == States.CONNECTED", but for a 
> closed connection the getConnection() call returns null.
> Further, there is currently no way to ask the pool itself to close 
> connections. There is a reset method, but reset simply discards the reference 
> to the connection without closing it. ZkConnectionPool.reset() should close 
> connections before dereferencing them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-56) Delayed state transition

2016-11-17 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674315#comment-15674315
 ] 

Lei Xia commented on HELIX-56:
--

We have a new design on this, will send out my proposed design and changes for 
review soon.

> Delayed state transition
> 
>
> Key: HELIX-56
> URL: https://issues.apache.org/jira/browse/HELIX-56
> Project: Apache Helix
>  Issue Type: Task
>Affects Versions: 0.6.0-incubating
>Reporter: kishore gopalakrishna
>
> The requirement from Puneet
> I wanted to know how to implement a specific state machine requirement in 
> Helix.
> Lets say a partition is in the state S2.
> 1. On an instance hosting it going down, the partition moves to state
> S3 (but stays on the same instance).
> 2. If the instance comes back up before a timeout expires, the
> partition moves to state S1 (stays on the same instance).
> 3. If the instance does not come back up before the timeout expiry,
> the partition moves to state S0 (the initial state, on a different
> instance picked up by the controller).
> I have a few questions.
> 1. I believe in order to implement Requirement 1, I have to use the
> CUSTOM rebalancing feature (as otherwise the partitions will get
> assigned to a new node).
> The wiki page says the following about the CUSTOM mode.
> "Applications will have to implement an interface that Helix will
> invoke when the cluster state changes. Within this callback, the
> application can recompute the partition assignment mapping"
> Which interface does one have to implement ?  I am assuming the
> callbacks are triggered inside the controller.
>  2. The transition from S2 -> S3 should not issue a callback on the
> participant (instance) holding that partition. This is because the
> participant is unavailable and so cannot execute the callback. Is this
> doable ?
> 3. One way the time-out (Requirement 3) can be implemented is to
> occasionally trigger IdealState calculation after a time-out and not
> only on liveness changes. Does that sound doable ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-60) Create Helix dashboard

2016-11-17 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674320#comment-15674320
 ] 

Lei Xia commented on HELIX-60:
--

We will going to work on a new Helix UI in 2017,  keep tuned!

> Create Helix dashboard
> --
>
> Key: HELIX-60
> URL: https://issues.apache.org/jira/browse/HELIX-60
> Project: Apache Helix
>  Issue Type: Task
>Affects Versions: 0.6.0-incubating
>Reporter: Kanak Biscuitwala
>  Labels: Java, cluster, distributed-systems, gsoc, gsoc2014, 
> javascript, mentor
>
> Right now, ZooInspector is the only graphical way of working with a Helix 
> cluster. It has the following drawbacks:
> - Ugly
> - Very dangerous
> - Unaware of Helix concepts
> It will be great to have a dashboard using play framework to
> - setup cluster
> - view cluster state
> - manage cluster
> With a dashboard, we can more logically show things in terms of Helix. This 
> for color coding, potential for drag-and-drop assignment, and saner admin.
> Programming language: Java, JavaScript, potentially others
> Experience required: Basic object-oriented programming, basic web development
> Helpful experience: ZooKeeper, Helix API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-658) Upgrade Zookeeper version

2017-06-02 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034714#comment-16034714
 ] 

Lei Xia commented on HELIX-658:
---

Will work on this at this weekend.

> Upgrade Zookeeper version
> -
>
> Key: HELIX-658
> URL: https://issues.apache.org/jira/browse/HELIX-658
> Project: Apache Helix
>  Issue Type: Improvement
>Affects Versions: 0.7.2
>Reporter: Beau Brower
>Assignee: Lei Xia
> Fix For: 0.7.2
>
>
> 0.6.7 has upgraded to 3.4.9
> 0.7.x branch should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HELIX-823) run-helix-controller.sh command gives error

2020-07-31 Thread Lei Xia (Jira)


[ 
https://issues.apache.org/jira/browse/HELIX-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168999#comment-17168999
 ] 

Lei Xia commented on HELIX-823:
---

[~dasahcc] Has this been fixed in lastest 1.0.1 release?

> run-helix-controller.sh command gives error
> ---
>
> Key: HELIX-823
> URL: https://issues.apache.org/jira/browse/HELIX-823
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
> Environment: Linux Red Hat 4.8.5-4
>Reporter: anil
>Priority: Blocker
>
> Download helix verversion 1.0.0 binary and setup two node cluster. Every 
> thing work fine.
> but when fire below command 
> ./run-helix-controller.sh --zkSvr localhost:2181 --cluster jbpm-cluster
> It gives error - 
> sterName:jbpm-cluster, controllerName:null, mode:STANDALONE
> Exception in thread "main" *java.lang.NoSuchFieldError: Rebalancer*
>  at org.apache.helix.InstanceType.(InstanceType.java:39)
>  at 
> org.apache.helix.controller.HelixControllerMain.startHelixController(HelixControllerMain.java:156)
>  at 
> org.apache.helix.controller.HelixControllerMain.main(HelixControllerMain.java:212)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HELIX-694) Cannot Add a Cluster

2020-07-31 Thread Lei Xia (Jira)


[ 
https://issues.apache.org/jira/browse/HELIX-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169000#comment-17169000
 ] 

Lei Xia commented on HELIX-694:
---

[~dasahcc] Has this been fixed in lastest 1.0.1 release?
 * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13306177]

> Cannot Add a Cluster
> 
>
> Key: HELIX-694
> URL: https://issues.apache.org/jira/browse/HELIX-694
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Lahiru Jayasekara
>Priority: Blocker
>
> When adding a cluster using the command,
> {{./helix-admin.sh --zkSvr localhost:2181 --addCluster MyCluster}}
> There will be an exception generated as follows, 
> Exception in thread "main" org.apache.helix.HelixException: cluster MyCluster 
> is not setup yet
>     at 
> org.apache.helix.manager.zk.ZKHelixAdmin.addStateModelDef(ZKHelixAdmin.java:708)
>     at 
> org.apache.helix.tools.ClusterSetup.addStateModelDef(ClusterSetup.java:343)
>     at org.apache.helix.tools.ClusterSetup.addCluster(ClusterSetup.java:152)
>     at 
> org.apache.helix.tools.ClusterSetup.processCommandLineArgs(ClusterSetup.java:1009)
>     at org.apache.helix.tools.ClusterSetup.main(ClusterSetup.java:1454)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HELIX-824) best work shoes for overweight

2020-07-31 Thread Lei Xia (Jira)


[ 
https://issues.apache.org/jira/browse/HELIX-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168998#comment-17168998
 ] 

Lei Xia commented on HELIX-824:
---

Spam, close it.

> best work shoes for overweight
> --
>
> Key: HELIX-824
> URL: https://issues.apache.org/jira/browse/HELIX-824
> Project: Apache Helix
>  Issue Type: Bug
>Reporter: Irinano
>Priority: Major
>
> When choosing shoes, you need to consider all the features of your body and 
> legs. After all, unsuitable shoes will not only wear out quickly but also can 
> cause you inconvenience. Therefore, for people who are overweight, the [best 
> work shoes for 
> overweight|https://gym-expert.com/best-walking-shoes-for-overweight-walkers/] 
> are best suited.
>   
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)