[GitHub] helix pull request #296: Skip resources with state model def ref as Task dur...

2018-11-14 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/296

Skip resources with state model def ref as Task during top state handoff

We should not report top state handoff for resources with state model def 
ref as "Task" as this is meaningless and creates too many mbeans in 
task-intense environments.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/bug-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/296.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #296


commit 54158099abe38e18229ae74e4707eb4c822405ec
Author: Harry Zhang 
Date:   2018-11-14T22:42:04Z

Skip resources with state model def ref as Task




---


[GitHub] helix pull request #294: Implement view cluster aggregator

2018-11-02 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/294

Implement view cluster aggregator

Based on #266 , design of helix view aggregator, here is the 
implementation. Helix view aggregator will be a different module under helix 
repo, and the impl will use components from helix-core.
Here is a debrief of this PR:

- View cluster related information will be added to `ClusterConfig`, and 
related java apis will be added to helix-core.
- `SourceClusterDataProvider` is a component that watches changes from 
source cluster, updates its cached data, and notify data change event via a 
given channel
- `ViewClusterRefresher` is a component that does the actual refresh 
operation of the view cluster. It computes diff between source clusters and 
view cluster, make changes to view cluster accordingly
- `SourceClusterConfigChangeAction` is a wrapper containing information 
about what to update for a source cluster. It takes in old and new 
`ClusterConfig` and compute actions to adopt view cluster to the changes
- `HelixViewAggregator` hooks up small components and contains the main 
reconciliation loop for refreshing view cluster
- Metrics recording mechanism is added
- starting helix view aggregator in a stand-alone mode and distributed mode 
(by adding state model into helix participant) is supported
- related unit tests and integration tests are added


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/view-aggregator-impl

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/294.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #294


commit d6b98f236d85522866e4e7584561a02ce525d142
Author: Harry Zhang 
Date:   2018-11-01T22:26:45Z

[HELIX-781]: add java apis for view cluster config support

commit 759cf8a9a48805851c50681afc98b2cafe309ff7
Author: Harry Zhang 
Date:   2018-11-02T21:28:23Z

[HELIX-781] set up module structure for helix view aggregator

commit 2ae85f6127b43aa7928776ede7e9bbbfe94e9385
Author: Harry Zhang 
Date:   2018-11-02T21:38:14Z

[HELIX-781] implement SourceClusterDataProvider and add tests

commit 1df6c7e15bc31fb424883e3febcd9f8cd019de82
Author: Harry Zhang 
Date:   2018-11-02T21:42:56Z

[HELIX-781] implement ViewClusterRefresher and add tests

commit 7802cf2443d45092c69cb529b4bba17a3c45dc31
Author: Harry Zhang 
Date:   2018-11-02T21:50:35Z

[HELIX-781] implement SourceClusterConfigChangeAction and added tests

commit 985a4ac2bcc43738d9c7ab463ad606f8691f7298
Author: Harry Zhang 
Date:   2018-11-02T21:59:21Z

[HELIX-781] implement helix view aggregator main logic and added tests

commit 3e2cfb695720fc09f181b55a3a31adc2b44f94a4
Author: Harry Zhang 
Date:   2018-11-02T22:08:43Z

[HELIX-781] added metrics to helix view aggregator and added tests

commit 82d33e0aad1ad2279f864fea6d5851129ba56610
Author: Harry Zhang 
Date:   2018-11-02T22:11:00Z

[HELIX-781] added main function to start helix view aggregator via bash

commit 11a7c126ca8bb9298754cb9127af588148af0902
Author: Harry Zhang 
Date:   2018-11-02T22:12:57Z

[HELIX-781] support deploy helix view aggregator in a distributed fashion 
using helix participant




---


[GitHub] helix pull request #292: [HELIX-785] Record helix latency instead of user la...

2018-11-02 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/292

[HELIX-785] Record helix latency instead of user latency in top state 
handoff metrics

- top state handoff reports helix latency instead of user latency
- modified test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/top-state-handoff-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/292.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #292


commit 37a58cfff91fb5f6608a4a06d1922bb5a5eb9ca1
Author: Harry Zhang 
Date:   2018-11-02T18:30:15Z

[HELIX-785] Record helix latency instead of user latency in top state 
handoff metrics




---


[GitHub] helix pull request #291: Fix unstable TestControllerLeadershipChange

2018-11-01 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/291

Fix unstable TestControllerLeadershipChange

- make setLeader more reliable
- restart participant after manager 1 regain leadership
- use cluster verifier to wait for cluster converge

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/test-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/291.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #291


commit 43bbc6454ab785a998936b5638c012c1a0076969
Author: Harry Zhang 
Date:   2018-11-02T00:50:09Z

Fix unstable TestControllerLeadershipChange




---


[GitHub] helix pull request #290: fix potential NPE in TopStateHandoffReportStage

2018-11-01 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/290

fix potential NPE in TopStateHandoffReportStage



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/minor-fixes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #290


commit aa1c9ff50975b53d80ba67cd8bdeb51fe782d73d
Author: Harry Zhang 
Date:   2018-11-02T00:48:05Z

fix potential NPE in TopStateHandoffReportStage




---


[GitHub] helix pull request #289: [HELIX-780] add task user content related api and a...

2018-11-01 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/289

[HELIX-780] add task user content related api and added more tests

- added get/add task user content rest api
- consolidated rest api behavior: when getting/adding user content, if 
job/workflow does not exist, throw 404
- added more test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/tf-rest-api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #289


commit 18aa67b6d5c703e5b938b2f915f52a6ca856e889
Author: Harry Zhang 
Date:   2018-10-09T21:31:00Z

[HELIX-780] add task user content related api and added more tests




---


[GitHub] helix pull request #287: [HELIX-780] add get/add job user content rest api

2018-11-01 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/287

[HELIX-780] add get/add job user content rest api

added apis and tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/tf-rest-api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #287


commit a09a18ac55464c3e399800b4474ccb6e64d168ec
Author: Harry Zhang 
Date:   2018-10-08T22:36:53Z

[HELIX-780] add get/add job user content rest api




---


[GitHub] helix pull request #285: [HELIX-779] do not clean list field in maintenance ...

2018-11-01 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/285

[HELIX-779] do not clean list field in maintenance rebalancer for new 
resources

Setting list fields to empty map will prevent newly added and initially 
rebalanced resources during maintenance mode from getting re-balanced after 
cluster exists maintenance mode.
The right thing to do is to clear every preference list.


Also added test case to verify

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/maintenance-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #285


commit bfaa8399529b6e63b307c1fbe60903c3ca08fbb1
Author: Harry Zhang 
Date:   2018-10-04T22:50:16Z

[HELIX-779] do not clean list field in maintenance rebalancer for new 
resources




---


[GitHub] helix pull request #283: [HELIX-775] consolidate user content related apis f...

2018-10-31 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/283

[HELIX-775] consolidate user content related apis for task driver

HELIX-1315: consolidate user content related apis for task driver


To consolidate task driver user content related apis, and corresponding 
rest apis, I'm deprecating the general getUserContent() api, but instead, we 
now have the following apis for get / add / update user content.

```java
public void addOrUpdateWorkflowUserContentMap(String workflowName,
  final Map contentToAddOrUpdate);

public void addOrUpdateJobUserContentMap(String workflowName, String 
jobName,
  final Map contentToAddOrUpdate);

public void addOrUpdateTaskUserContentMap(String workflowName, String 
jobName,
  String taskPartitionId, final Map 
contentToAddOrUpdate);


public Map getWorkflowUserContentMap(String workflowName);


public Map getJobUserContentMap(String workflowName, String 
jobName);

public Map getTaskUserContentMap(String workflowName, 
String jobName,
  String taskPartitionId);
```

delete user content api tbd but can use the same convension

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-user-content

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/283.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #283


commit b235c4ee5a82c5970d29e839317ea242813a58bc
Author: Harry Zhang 
Date:   2018-10-04T18:25:08Z

[HELIX-775] consolidate user content related apis for task driver




---


[GitHub] helix pull request #282: [HELIX-775] add task driver support for helix rest ...

2018-10-31 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/282

[HELIX-775] add task driver support for helix rest to add/get task fr…

…amework user content


consolidate user content related apis for task driver


To consolidate task driver user content related apis, and corresponding 
rest apis, I'm deprecating the general getUserContent() api, but instead, we 
now have the following apis for get / add / update user content.

```java
public void addOrUpdateWorkflowUserContentMap(String workflowName,
  final Map contentToAddOrUpdate);

public void addOrUpdateJobUserContentMap(String workflowName, String 
jobName,
  final Map contentToAddOrUpdate);

public void addOrUpdateTaskUserContentMap(String workflowName, String 
jobName,
  String taskPartitionId, final Map 
contentToAddOrUpdate);


public Map getWorkflowUserContentMap(String workflowName);


public Map getJobUserContentMap(String workflowName, String 
jobName);

public Map getTaskUserContentMap(String workflowName, 
String jobName,
  String taskPartitionId);
```

API for deleting user content is TBD but can use the same convension

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-user-content

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/282.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #282


commit 7ec5313bccb679014d6a0605ee5d7184063e555e
Author: Harry Zhang 
Date:   2018-10-31T20:55:44Z

[HELIX-775] add task driver support for helix rest to add/get task 
framework user content




---


[GitHub] helix pull request #281: [HELIX-773] add getLastScheduledTaskTimestamp infor...

2018-10-30 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/281

[HELIX-773] add getLastScheduledTaskTimestamp information in workflow rest 
API

- Added TaskExecutionInfo object to wrap task execution information
- added TaskExecutionInfo to last scheduled task in workflow property in 
workflow rest API
- Modified related tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/workflow-rest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/281.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #281


commit 917f6b7ee1b2b44b10eea7e5de7f07aa7f184618
Author: Harry Zhang 
Date:   2018-10-30T23:43:25Z

[HELIX-773] add getLastScheduledTaskTimestamp information in workflow rest 
api




---


[GitHub] helix pull request #280: [HELIX-772] add TaskDriver.addUserContent() api and...

2018-10-30 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/280

[HELIX-772] add TaskDriver.addUserContent() api and related tests


Implemented TaskDriver.addUserContent()
Added test (TestGetSetUserContentStore) for testing all getter/setter for 
user content
Modified unstable TestIndependentTaskRebalancer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/add-user-content

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #280


commit df24f5975bd517626490f14e6e038f8370ddd815
Author: Harry Zhang 
Date:   2018-10-30T23:25:12Z

[HELIX-772] add TaskDriver.addUserContent() api and related tests




---


[GitHub] helix pull request #278: [HELIX-771] More detailed top state handoff metrics

2018-10-30 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/278

[HELIX-771] More detailed top state handoff metrics


Added more details about top state handoff to distinguish helix latency and 
user latency


We define there are 2 types of handoff
- Graceful handoff (controlled top state handoff, i.e. disable instance, 
load balance, etc)
- Non-Graceful (uncontroller top state handoff, i.e. node crash, etc)


For graceful handoff, we record total handoff latency and user latency
For non-graceful handoff, we record total handoff only


Moved top state handoff metrics to an independent stage to make logics 
cleaner.\
Refactored TestTopStateHandoffmetrics to make it cleaner and more json more 
natively

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/topstate-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #278


commit 7e49f995e29ea200fcc42ce6af148ed521979f5c
Author: Harry Zhang 
Date:   2018-10-30T22:55:20Z

[HELIX-771] More detailed top state handoff metrics




---


[GitHub] helix pull request #266: Propose design for aggregated cluster view service

2018-10-23 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/266#discussion_r227597163
  
--- Diff: designs/aggregated-cluster-view/design.md ---
@@ -0,0 +1,353 @@
+Aggregated Cluster View Design
+==
+
+## Introduction
+Currently Helix organize information by cluster - clusters are autonomous 
entities that holds resource / node information.
+In real practice, a helix client might need to access aggregated 
information of helix clusters from different data center regions for management 
or coordination purpose.
+This design proposes a service in Helix ecosystem for clients to retrieve 
cross-datacenter information in a more efficient way. 
+
+
+## Problem Statement
+We identified a couple of use cases for accessing cross datacenter 
information. [Ambry](https://github.com/linkedin/ambry) is one of them.
--- End diff --

Sure (will also update design doc about it).

Ambry uses Helix spectator in both their router (for retrying get requests 
remotely if failed locally) and storage node (for data replication purpose). 
Given the amount of clients that need global information, it would be more 
cost-effective for them if aggregated information are provided locally.


---


[GitHub] helix pull request #266: Propose design for aggregated cluster view service

2018-10-23 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/266#discussion_r227562948
  
--- Diff: designs/aggregated-cluster-view/design.md ---
@@ -0,0 +1,353 @@
+Aggregated Cluster View Design
+==
+
+## Introduction
+Currently Helix organize information by cluster - clusters are autonomous 
entities that holds resource / node information.
+In real practice, a helix client might need to access aggregated 
information of helix clusters from different data center regions for management 
or coordination purpose.
+This design proposes a service in Helix ecosystem for clients to retrieve 
cross-datacenter information in a more efficient way. 
+
+
+## Problem Statement
+We identified a couple of use cases for accessing cross datacenter 
information. [Ambry](https://github.com/linkedin/ambry) is one of them.
+Here is a simplified example: some service has Helix cluster "MyDBCluster" 
in 3 data centers respectively, and each cluster has a resource named "MyDB".
+To federate this "MyDBCluster", current usage is to have each federation 
client (usually Helix spectator) to connect to metadata store endpoints in all 
fabrics to retrieve information and aggregate them locally.
+Such usge has the following drawbacks:
+
+* As there are a lot of clients in each DC that need cross-dc information, 
there are a lot of expensive cross-dc traffics
+* Every client needs to know information about metadata stores in all 
fabrics which
+  * Increases operational cost when these information changes
+  * Increases security concern by allowing cross data center traffic
+
+To solve the problem, we have the following requirements:
+* Clients should still be able to GET/WATCH aggregated information from 1 
or more metadata stores (likely but not necessarily from different data centers)
+* Cross DC traffic should be minimized
+* Reduce amount of information about data center that a client needs
+* Agility of information aggregation can be configured
+* Currently, it's good enough to have only LiveInstance, InstanceConfig, 
and ExternalView aggregated
+
+
+
+
+
+## Proposed Design
+
+To provide aggregated cluster view, the solution I'm proposing is to add a 
special type of cluster, i.e. **View Cluster**.
+View cluster leverages current Helix semantics to store aggregated 
information of various **Source Clusters**.
+There will be another micro service (Helix View Aggregator) running, 
fetching information from clusters (likely from other data centers) to be 
aggregated, and store then to the view cluster.
--- End diff --

though setting up observer local to clients can potentially reduce cross 
data center traffic, but has a few draw backs:
1. all data changes will be propagated immediately, and if such information 
is not required frequently, there will be wasted traffic. Building a service 
makes it possible to customize aggregation granularity
2. Using zookeeper observer leaves aggregation logic to client - providing 
aggregated data will make it easier for user to consume
3. Building a service will leave space to customize aggregated data in the 
future, i.e. if we want to aggregate idea state, we might not need to aggregate 
preference list, etc

Will add these points into design doc 


---


[GitHub] helix pull request #270: [HELIX-753] Record top state handoff finished in si...

2018-09-21 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/270

[HELIX-753] Record top state handoff finished in single cluster data cache 
refresh

This PR adds top state handoff reporting when a single pipeline refresh 
catches the entire handoff process, which we missed before. Here is the rough 
procedure:


- retrieve cached last top state instance for a partition
- retrieve current top state instance for a partition
- if there is no missing top state record of that partition, and top state 
instance changed, we record the number

Current top state end time is easy to find from current state in cluster 
data cache, for handoff start time, if we cannot find it, we use last pipeline 
run's end time for best guess. Detailed reason is explained in code comment.


Added test case to verify such top state handoff, and consolidated common 
part in TestTopStateHandoffMetrics for avoiding code replication

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/topstate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/270.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #270


commit d501e8fa30596d9cd98078f0d1ce7c1ecf20c595
Author: Harry Zhang 
Date:   2018-09-21T21:32:15Z

[HELIX-753] Record top state handoff finished in single cluster data cache 
refresh




---


[GitHub] helix pull request #266: Propose design for aggregated cluster view service

2018-08-20 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/266

Propose design for aggregated cluster view service

This PR adds a design doc for aggregated cluster view service.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/view-aggregator-design

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #266


commit d7e6c1e0d51229319094e025ad6b70f5d5deed3e
Author: Harry Zhang 
Date:   2018-08-21T02:11:14Z

Propose design for aggregated cluster view service




---


[GitHub] helix pull request #258: [HELIX-741] make swap instance more robust and idem...

2018-07-17 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/258

[HELIX-741] make swap instance more robust and idempotent

Made swap instance more robust:
1. List ideal state names and read ideal state individually to avoid 
partial read
2. remove redundant logics that test old instance status
3. make it idempotent
4. added test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-admin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #258


commit 24c52394dfff91c045367260c969f76560ebeb62
Author: Harry Zhang 
Date:   2018-07-18T01:21:48Z

[HELIX-741] make swap instance more robust and idempotent




---


[GitHub] helix pull request #257: [HELIX-740] check NPE in getInstancesInClusterWithT...

2018-07-17 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/257

[HELIX-740] check NPE in getInstancesInClusterWithTag and throw more 
meaningful exception

Added cluster config check in `getInstancesInClusterWithTag()` and throw 
IllegalStateException when instance config is missing

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-admin

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/257.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #257


commit f4bb7d60782150c7d713c907211cc9d41f002c48
Author: Harry Zhang 
Date:   2018-07-17T22:50:02Z

[HELIX-740] check NPE in getInstancesInClusterWithTag and throw more 
meaningful exception




---


[GitHub] helix issue #248: helix manager should support getting metadata store connec...

2018-07-16 Thread zhan849
Github user zhan849 commented on the issue:

https://github.com/apache/helix/pull/248
  
@kishoreg We have a request here that it would be handy to retrieve zk 
address from ZkHelixManager for user components to perform some customized 
operations in ZooKeeper without sharing same ZkClient with same helix 
component. As zk address is part of ZkHelixManager's configurations so adding a 
getter here fits the semantics.

To make it more general (also to introduce less code change), such method 
should be part of HelixManager interface, and "MetadataStore" is a more generic 
name to use here.


---


[GitHub] helix pull request #248: helix manager should support getting metadata store...

2018-07-16 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/248

helix manager should support getting metadata store connection string

Add an API to get metadatastore connection string in Helix Manager

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-manager

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #248


commit 921d1fc9822a2ae3ddd2adc854a12ec486ad6c08
Author: Harry Zhang 
Date:   2018-07-16T22:23:30Z

helix manager should support getting metadata store connection string




---


[GitHub] helix pull request #224: [HELIX-718] implement ThreadCountBasedTaskAssigner

2018-07-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/224

[HELIX-718] implement ThreadCountBasedTaskAssigner

In this RB, I implemented a thread count based task assigner that is 
optimized for short-term use cases. It assumes:
- All tasks to assign have same quota type
- All tasks to assign requires only 1 thread


The algorithms did best effort that tasks with same type / same job are 
spread out: i.e.
- if there are 3 nodes, each has 10 threads for each quota type A, B, and C
- node1 is empty, node2 and node3 each has 5 typeB tasks and 5 typeC tasks 
running
=> when 3 typeA tasks are to be assigned, it will assign 1 typeA task to 
each node rather than squeeze all 3 typeA tasks to node1.



Added tests for the assigner. Below is the profiling results, each result 
takes average of 100 trails:


Assign 50K tasks onto 1K nodes:

testing batch size: 1
Average time: 118ms
testing batch size: 5000
Average time: 114ms
testing batch size: 2000
Average time: 117ms
testing batch size: 1000
Average time: 119ms
testing batch size: 500
Average time: 123ms
testing batch size: 100
Average time: 182ms



Assign 10K tasks onto 1K nodes:

testing batch size: 1
Average time: 25ms
testing batch size: 5000
Average time: 21ms
testing batch size: 2000
Average time: 22ms
testing batch size: 1000
Average time: 25ms
testing batch size: 500
Average time: 22ms
testing batch size: 100
Average time: 34ms

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/simple-assigner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #224


commit 6cb574d5aea6ca9cb9e6b5184bc80cb5e05d53b8
Author: Harry Zhang 
Date:   2018-07-09T23:04:19Z

[HELIX-718] implement ThreadCountBasedTaskAssigner




---


[GitHub] helix pull request #223: [HELIX-718] provide a method in AssignableInstance ...

2018-07-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/223

[HELIX-718] provide a method in AssignableInstance to set current assignment

This is required when an assignable instance is initialized, it needs to 
recover its current states

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/assignable-instance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/223.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #223


commit e44b29e03ef4c807e940cde717ed2f6fff58a273
Author: Harry Zhang 
Date:   2018-07-09T22:59:27Z

[HELIX-718] provide a method in AssignableInstance to set current 
assignments




---


[GitHub] helix pull request #220: [HELIX-718] implement TaskAssignResult

2018-07-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/220

[HELIX-718] implement TaskAssignResult

Implement TaskAssignResult as a part of task assigner

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-assign-result

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #220


commit 701947d5a033792f21dd2796a29577702782fd26
Author: Harry Zhang 
Date:   2018-07-09T21:22:20Z

[HELIX-718] implement TaskAssignResult




---


[GitHub] helix pull request #219: [HELIX-717] Add api for get / set quota type, ratio...

2018-07-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/219

[HELIX-717] Add api for get / set quota type, ratio and participant capacity

Add api for get / set quota type, ratio and participant capacity

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/task-quota

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #219


commit 9ff603e9c39b53d5035cccb31fcf6edf82d97f18
Author: Harry Zhang 
Date:   2018-07-09T21:07:56Z

[HELIX-717] Add api for get / set quota type, ratio and participant capacity




---


[GitHub] helix pull request #218: minor logging improvements

2018-07-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/218

minor logging improvements

minor log fixes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/minor-improvements

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #218


commit 3870ab0f31a8f8a44ac5816ed0bde38fc7433bd0
Author: Harry Zhang 
Date:   2018-07-09T20:56:57Z

minor logging improvements




---


[GitHub] helix pull request #214: [HELIX-709] Move external view calculation to async...

2018-07-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/214

[HELIX-709] Move external view calculation to async stage and re-organize 
pipeline

- Separated controller pipeline to execute external view compute async and 
as early as possible
- renamed AbstractAsyncBaseStage
- fixed NPE in callback handler
- all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/async-ev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #214


commit 542fbc840a167986a40bd57f3c5660d294acb63c
Author: Harry Zhang 
Date:   2018-07-09T19:16:56Z

[HELIX-709] Move external view calculation to async stage and re-organize 
pipeline




---


[GitHub] helix pull request #209: [HELIX-710] Create abstract state model for distrib...

2018-06-28 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/209

[HELIX-710] Create abstract state model for distributed leader standby 
helix service

This RB abstracts a leader standby state model that helix services such as 
controller or other services  would commonly use. This reduces duplicated code 
and simplifies state model implementation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/abstract-ls-state-model

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #209


commit 4a99bc43c6f22e478a49fb7f2bbac42d608f17b5
Author: Harry Zhang 
Date:   2018-06-28T21:32:51Z

[HELIX-710] Create abstract state model for distributed leader standby 
helix service




---


[GitHub] helix pull request #208: [HELIX-709] Prepare controller stages for async exe...

2018-06-28 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/208

[HELIX-709] Prepare controller stages for async execution

- Implemented AbstractAsyncBaseStage
- Refactored TEVCalcState and PersistAssignmentStage to use 
AbstractAsyncBaseStage

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/aabs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #208


commit 9080c64429d724aa959207411ca06d690f5ee840
Author: Harry Zhang 
Date:   2018-06-28T21:25:21Z

[HELIX-709] Prepare controller stages for async execution




---


[GitHub] helix pull request #206: [HELIX-706] process tev and persist assignment asyn...

2018-06-26 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/206

[HELIX-706] process tev and persist assignment asynchronously

Added async worker in generic helix controller to process persist 
assignment stage and tev generation state asynchronously

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/async-ev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #206


commit bb4ffd7e5663377427a5ad5988948659dd0db378
Author: Harry Zhang 
Date:   2018-06-26T23:05:50Z

[HELIX-706] process tev and persist assignment asynchronously




---


[GitHub] helix pull request #204: [HELIX-705]: Participant duplicated state transitio...

2018-06-25 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/204

[HELIX-705]: Participant duplicated state transition handling rework

Re-implemented helix task executor state transition message dedup logic, 
and added tests for verifying it:

- Duplicated message in same batch: discard the later one
- Duplicated message in different batches, the later one should be 
discarded if the first one is in progress
- During state transition, we should not rely on current state delta to get 
partition's current state, but should lock on state model def (thread safety)
- Duplicated state transition (toState == currentState) should not result 
in error, which is confusion, but should report success

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/participant-st-dedup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #204


commit 04f1ba9701ccfb4c55d44ab4bc159577c3afd68b
Author: Harry Zhang 
Date:   2018-06-25T22:55:14Z

[HELIX-705]: Participant duplicated state transition handling rework




---


[GitHub] helix pull request #200: throw new Exception to avoid ugly NPE

2018-05-03 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/200#discussion_r185911985
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/tools/commandtools/ZkGrep.java ---
@@ -463,9 +463,8 @@ static File gunzip(File zipFile) {
   return outputFile;
 } catch (IOException e) {
   LOG.error("fail to gunzip file: " + zipFile, e);
+  throw new Exception("fail to gunzip file" + zipFile);
--- End diff --

@lujiefsi might not be a good idea to only check lastZkSnapshot, because 
gunzip() is used in multiple places and you need to check all of them.


---


[GitHub] helix issue #201: add null check in DeplayedAutoRebalancerz#computeNewIdealS...

2018-05-03 Thread zhan849
Github user zhan849 commented on the issue:

https://github.com/apache/helix/pull/201
  
@lujiefsi currently we have the assumption that resource will have state 
model def, which is registered by ParticipantManager. If you really want to fix 
the issue, then I'd suggest doing the following:

- in computeNewIdealState, log error when we cannot find state model def, 
and mark all partitions as error, ResourceMonitor need to be updated 
accordingly. Don't throw exception as this will block rebalancing for all other 
valid resources

- WorkflowConfig's start time is fetched from it's ScheduleConfig, which is 
enforced by builder (if no start time is provided, builder will fail the 
build). So we can assume it is always there. Similarly, if you really want to 
add check (assuming someone did not use our API to create object), don't throw 
exception, log error and record failure in workflow monitor


---


[GitHub] helix pull request #200: throw new Exception to avoid ugly NPE

2018-04-30 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/200#discussion_r185050120
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/tools/commandtools/ZkGrep.java ---
@@ -463,9 +463,8 @@ static File gunzip(File zipFile) {
   return outputFile;
 } catch (IOException e) {
   LOG.error("fail to gunzip file: " + zipFile, e);
+  throw new Exception("fail to gunzip file" + zipFile);
--- End diff --

yes.

Tooling is fine here as NPE is caught outside, and proper error message are 
printed out.

BTW, wrapping IOException using generic Exception will erase the proper 
semantics that IOException carries, which is not a good practice.


---


[GitHub] helix issue #201: add null check in DeplayedAutoRebalancerz#computeNewIdealS...

2018-04-30 Thread zhan849
Github user zhan849 commented on the issue:

https://github.com/apache/helix/pull/201
  
1. IDEs are already doing NPE checking for us.
2. You are just detecting null and throw another exception, how's it 
different than NPE?


---


[GitHub] helix pull request #195: [HELIX-682] delete duplicated message and log error...

2018-04-24 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/195

[HELIX-682] delete duplicated message and log error in HelixTaskExecutor on 
participant

This PR is the second part of message dedup on participant side

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/participant-msg-dedup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195


commit 8aba9bea0734da11722fbc8cceb74f34dd6a37c6
Author: Harry Zhang <zhan849@...>
Date:   2018-04-24T22:34:08Z

[HELIX-682] delete duplicated message and log error in HelixTaskExecutor on 
participant




---


[GitHub] helix pull request #194: Fix broken TestWorkflowTermination

2018-04-24 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/194

Fix broken TestWorkflowTermination

test is broken by a temp fix before not to set JobState to NOT_STARTED when 
initializing workflow context, this PR fixes the test according to temp fix

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/test-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/194.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #194


commit 1d4df9c0739b10be4802dbe36f9870f097a21e6f
Author: Harry Zhang <zhan849@...>
Date:   2018-04-24T19:47:28Z

Fix broken TestWorkflowTermination




---


[GitHub] helix pull request #184: fix broken TestTaskCreateThrottling

2018-04-19 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/184

fix broken TestTaskCreateThrottling

this PR fixes a broken test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/test-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #184


commit 0d7bbfc2d181231d354a37fa0c8bdcfa22f6a07d
Author: Harry Zhang <zhan849@...>
Date:   2018-04-19T18:42:01Z

fix broken TestTaskCreateThrottling




---


[GitHub] helix issue #180: Two minor fixes

2018-04-18 Thread zhan849
Github user zhan849 commented on the issue:

https://github.com/apache/helix/pull/180
  
@lei-xia done


---


[GitHub] helix pull request #182: [HELIX-695] add helix manager listener for new conn...

2018-04-16 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/182

[HELIX-695] add helix manager listener for new connection notification

In this PR I added invocation and related tests of 
`stateListener.onConnected()` method in ZkHelixManager when it is connected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/helix-manager-onconnected

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/182.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #182


commit 65e84713503437c542e545abd521c2ba6d26
Author: Harry Zhang <zhan849@...>
Date:   2018-04-16T17:05:30Z

[HELIX-695] add helix manager listener for new connection notification




---


[GitHub] helix pull request #174: [HELIX-691] Allow users to update InstanceConfig

2018-04-09 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/174#discussion_r180175528
  
--- Diff: 
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstanceAccessor.java
 ---
@@ -223,60 +224,60 @@ public Response 
updateInstance(@PathParam("clusterId") String clusterId,
   }
 
   switch (cmd) {
-  case enable:
-admin.enableInstance(clusterId, instanceName, true);
-break;
-  case disable:
-admin.enableInstance(clusterId, instanceName, false);
-break;
-  case reset:
-if (!validInstance(node, instanceName)) {
-  return badRequest("Instance names are not match!");
-}
-admin.resetPartition(clusterId, instanceName,
-node.get(InstanceProperties.resource.name()).toString(), 
(List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.partitions.name()).toString(),
-OBJECT_MAPPER.getTypeFactory()
-.constructCollectionType(List.class, 
String.class)));
-break;
-  case addInstanceTag:
-if (!validInstance(node, instanceName)) {
-  return badRequest("Instance names are not match!");
-}
-for (String tag : (List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.instanceTags.name()).toString(),
-
OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, 
String.class))) {
-  admin.addInstanceTag(clusterId, instanceName, tag);
-}
-break;
-  case removeInstanceTag:
-if (!validInstance(node, instanceName)) {
-  return badRequest("Instance names are not match!");
-}
-for (String tag : (List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.instanceTags.name()).toString(),
-
OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, 
String.class))) {
-  admin.removeInstanceTag(clusterId, instanceName, tag);
-}
-break;
-  case enablePartitions:
-admin.enablePartition(true, clusterId, instanceName,
-node.get(InstanceProperties.resource.name()).getTextValue(),
-(List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.partitions.name()).toString(),
-OBJECT_MAPPER.getTypeFactory()
-.constructCollectionType(List.class, 
String.class)));
-break;
-  case disablePartitions:
-admin.enablePartition(false, clusterId, instanceName,
-node.get(InstanceProperties.resource.name()).getTextValue(),
-(List) OBJECT_MAPPER
-
.readValue(node.get(InstanceProperties.partitions.name()).toString(),
-
OBJECT_MAPPER.getTypeFactory().constructCollectionType(List.class, 
String.class)));
-break;
-  default:
-_logger.error("Unsupported command :" + command);
-return badRequest("Unsupported command :" + command);
+case enable:
--- End diff --

Helix's formatter does not indent case, could you pls revert it back? Same 
for other places


---


[GitHub] helix pull request #174: [HELIX-691] Allow users to update InstanceConfig

2018-04-09 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/174#discussion_r180174778
  
--- Diff: 
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/ResourceAccessor.java
 ---
@@ -84,10 +96,66 @@ public Response getResources(@PathParam("clusterId") 
String clusterId) {
 return JSONRepresentation(root);
   }
 
+  /**
--- End diff --

Partition health related changes are not part of this PR (allow user to 
change instance config), can we file different issues?


---


[GitHub] helix pull request #174: [HELIX-691] Allow users to update InstanceConfig

2018-04-09 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/174#discussion_r180178181
  
--- Diff: 
helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/InstanceAccessor.java
 ---
@@ -315,26 +316,27 @@ public Response 
getInstanceConfig(@PathParam("clusterId") String clusterId,
 return notFound();
   }
 
-  @PUT
+  @POST
--- End diff --

PUT is for "set" and POST is for "patch", I'd suggest we keep both. 
@dasahcc thoughts?


---


[GitHub] helix pull request #175: [HELIX-692] use map instead of list to avoid deleti...

2018-04-06 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/175

[HELIX-692] use map instead of list to avoid deleting redundant message 
during cleanup

Currently in MessageGenerationPhase, we are using list to store messages to 
GC. However, pending message is stored per resource/partition/instance, and 
under batch message mode, same message is stored once for each partition in the 
batch, which lead to the fact that we are cleaning up same message a lot of 
times.


This RB changes list to map to avoid redundant cleanup

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/HELIX-692

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/175.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #175


commit ce7d2e9d275e1403375edd94d63afa94ea1a2234
Author: Harry Zhang <zhan849@...>
Date:   2018-04-06T23:27:02Z

[HELIX-692] use map instead of list to avoid deleting redundant message 
during cleanup




---


[GitHub] helix pull request #173: [HELIX-689] remove redundant logs from zkclient

2018-04-03 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/173

[HELIX-689] remove redundant logs from zkclient

Currently, in controller message cleanup, we print out 2 lines of message 
when message does not exist, which is totally redundant. In this PR, I removed 
the warning message from controller, and added error message in zkclient only 
when there is real error (exception from below). If we fail to delete a ZNode 
because znode does not exist, we do not print out message any more except debug 
mode

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/ctl-msg-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #173


commit d96a40caf19efffed3939b6dd8d9efe40734ec15
Author: Harry Zhang <zhan849@...>
Date:   2018-04-03T21:22:53Z

[HELIX-689] remove redundant logs from zkclient




---


[GitHub] helix pull request #162: [HELIX-683] clean monitoring cache upon helix contr...

2018-03-26 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/162

[HELIX-683] clean monitoring cache upon helix controller enable monitoring

In this PR I added methods to clear monitoring records in cache when we 
enable cluster status monitoring. I also added tests to reproduce situation 
that a resource missed top state, controller lost leadership, resource regain 
top state, controller regain leadership, which will cause a metrics reporting 
problem

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix 
harry/controller-monitor-cache-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #162


commit 373da77547fa1ea4a39c760e80da75e9d453d4f5
Author: Harry Zhang <zhan849@...>
Date:   2018-03-26T19:14:07Z

[HELIX-683] clean monitoring cache upon helix controller enable monitoring




---


[GitHub] helix pull request #156: [HELIX-682] controller should delete obsolete messa...

2018-03-22 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/156#discussion_r176498024
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationPhase.java
 ---
@@ -121,6 +131,18 @@ public void process(ClusterEvent event) throws 
Exception {
 
   Message message = null;
 
+  if (shouldCleanUpPendingMessage(pendingMessage, currentState,
+  currentStateOutput.getEndTime(resourceName, partition, 
instanceName))) {
+logger.info(
+"Adding pending message {} on instance {} to GC. Msg: 
{}->{}, current state of resource {}:{} is {}",
--- End diff --

changed it to "cleanup"


---


[GitHub] helix pull request #152: [HELIX-681] don't fail state transition task if we ...

2018-03-21 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/152#discussion_r176204863
  
--- Diff: helix-core/src/main/java/org/apache/helix/util/HelixUtil.java ---
@@ -219,4 +220,22 @@ public static String serializeByComma(List 
objects) {
 
 return idealStateMap;
   }
+
+  /**
+   * Remove the given message from ZK using the given accessor. This 
function will
+   * not throw exception
+   * @param accessor HelixDataAccessor
+   * @param msg message to remove
+   * @param instanceName name of the instance on which the message sits
+   * @return true if success else false
+   */
+  public static boolean removeMessageFromZK(HelixDataAccessor accessor, 
Message msg,
+  String instanceName) {
+try {
+  return accessor.removeProperty(msg.getKey(accessor.keyBuilder(), 
instanceName));
+} catch (Exception e) {
--- End diff --

it will not. the reason I did a general try-catch here is because I want to 
keep removeProperty semantics (only return true/false) here, but we do have 
leaked exception in underlying implementations of removeProperty()


---


[GitHub] helix pull request #152: [HELIX-681] don't fail state transition task if we ...

2018-03-21 Thread zhan849
Github user zhan849 commented on a diff in the pull request:

https://github.com/apache/helix/pull/152#discussion_r176182830
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTask.java ---
@@ -168,7 +169,14 @@ public HelixTaskResult call() {
 
   // forward relay messages attached to this message to other 
participants
   if (taskResult.isSuccess()) {
-forwardRelayMessages(accessor, _message, 
taskResult.getCompleteTime());
+try {
+  forwardRelayMessages(accessor, _message, 
taskResult.getCompleteTime());
+} catch (Exception e) {
+  // Fail to send relay message should not result in a task 
execution failure
+  // Currently we don't log error to ZK to reduce writes as when 
accessor throws
+  // exception, ZK might not be in good condition.
+  logger.error("Failed to send relay messages.", e);
--- End diff --

will change


---


[GitHub] helix pull request #156: [HELIX-682] controller should delete obsolete messa...

2018-03-20 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/156

[HELIX-682] controller should delete obsolete messages with timeout to 
unblock state transition

This RB contains implementations and tests for controller: during 
MessageGenerationPhase, it checks if the pending message should be cleaned up 
on participant to unblock further state transition:

- If partition's current state is same as message's toState, and the 3sec 
timeout already passed, in this case, it's likely that participant failed to 
delete message and controller should proactively remove the message so further 
rebalance could be unblocked
- If partition's current state is same as message's fromState, this means 
the partition is undergoing state transition or the state transition has not 
started yet, in this case, we do nothing
- If partition's current state is neither message's fromState nor toState 
(almost impossible), this means this message is a problematic one, and it is 
safe to delete it immediately so participant would not undergo an unnecessary 
message handling

Message deletion on controller side is async

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/controller-msg-dedup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #156


commit 9f789dee0b17886bd97ebf4cc14e9d867043183d
Author: Harry Zhang <zhan849@...>
Date:   2018-03-21T01:47:02Z

[HELIX-682] controller should delete obsolete messages with timeout to 
unblock state transition




---


[GitHub] helix pull request #152: [HELIX-681] don't fail state transition task if we ...

2018-03-19 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/152

[HELIX-681] don't fail state transition task if we fail to remove message 
or send out relay message

This PR includes fix on participant side:
1. Consolidated message deletion logic to HelixUtil, as we currently have 
duplicated logics in various places
2. When we fail to delete message, we don't throw exception to fail task
3. When we fail to send out relay message, we don't throw exception to fail 
task

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/HELIX-681

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #152






---


[GitHub] helix pull request #148: Move RoutingDataCache to BasicDataCache as a sharab...

2018-03-14 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/148

Move RoutingDataCache to BasicDataCache as a sharable component

In this commit, I moved main logics of RoutingDataCache to 
BasicClusterDatqaCache under helix.common, to make it a commonly share-able 
component.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/cache-refactor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/148.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #148


commit baf8b830e787ba1b31d4f0ad2c07f3eb33a9208f
Author: Harry Zhang <zhan849@...>
Date:   2018-03-14T18:58:45Z

Move RoutingDataCache to BasicDataCache as a sharable component




---


[GitHub] helix issue #146: [HELIX-680] add system setting to unblock TestZkCallbackHa...

2018-03-14 Thread zhan849
Github user zhan849 commented on the issue:

https://github.com/apache/helix/pull/146
  
@lei-xia just rebased


---


[GitHub] helix pull request #146: [HELIX-680] add system setting to unblock TestZkCal...

2018-03-09 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/146

[HELIX-680] add system setting to unblock TestZkCallbackHandlerLeak test 
with zookeeper 3.4.11 upgrade

By adding system property in ZkUnitTestBase `beforeSuite()`, 
`TestZkCallbackHandlerLeak` can pass now

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/zk-test-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/146.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #146


commit e4de3c2247042754ed193789b8a8671012576e7d
Author: hrzhang <hrzhang@...>
Date:   2018-03-09T20:20:16Z

[HELIX-680] add system setting to unblock TestZkCallbackHandlerLeak test 
with zookeeper 3.4.11 upgrade




---


[GitHub] helix pull request #140: [HELIX-679] consolidate semantics of recursively de...

2018-03-08 Thread zhan849
GitHub user zhan849 opened a pull request:

https://github.com/apache/helix/pull/140

[HELIX-679] consolidate semantics of recursively delete path in ZkClient

This change consolidates semantics of APIs in ZkClient that recursively 
deletes a path

* For backward compatibility, we keep `deleteRecursive()`, which will only 
return true/false, and will not throw exception.
* create a new method called deleteRecursively() that will only throw 
exception upon error.
* mark `deleteRecursive()` as deprecated as throwing exception can carry 
error information
* make all current usage of `deleteRecursive()` to `deleteRecursively()` 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhan849/helix harry/zk-client-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/helix/pull/140.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #140


commit 8412d3f7e7d8097eee820c4d055b1526ac74aca1
Author: hrzhang <hrzhang@...>
Date:   2018-03-08T22:04:42Z

[HELIX-679] consolidate semantics of recursively delete path in ZkClient




---