[jira] [Created] (HELIX-551) External view partition states go out of sync

2014-11-18 Thread Varun Sharma (JIRA)
Varun Sharma created HELIX-551:
--

 Summary: External view  partition states go out of sync
 Key: HELIX-551
 URL: https://issues.apache.org/jira/browse/HELIX-551
 Project: Apache Helix
  Issue Type: Bug
Affects Versions: 0.6.4
Reporter: Varun Sharma


Hi,

I am seeing the following issue for many partitions in helix using a simple 
Online-Offline state model factory. The external view says that the partition 
has been assigned to 3 hosts. However, when I look at the hosts only 1 of them 
executed the OFFLINE -- ONLINE transition.

On the hosts, that did not execute the transition, I see the following:

2014-11-13 09:29:54,394 [pool-3-thread-11] 
(HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be 
stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 
12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, 
ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, 
EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, 
MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, 
MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, 
RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, 
SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, 
STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, 
TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, 
TO_STATE=ONLINE}{}{} 

When I grep the message ID in the controller, I see the following:

2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) 
INFO  {

  id : 149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201,

  mapFields : {

HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION 
c1193025-b416-49d7-adc2-10afe2389141 : {

  AdditionalInfo : Message execution failed. msgId: 
12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: 
org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
 Current state of stateModel does not match the fromState in Message, Current 
State:ONLINE, message expected:OFFLINE, partition: 490, from: 
hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256,

  Class : class 
org.apache.helix.messaging.handling.HelixStateTransitionHandler,

  MSG_ID : 12690ce8-8098-46b1-a93d-279604f0e3db,

  Message state : READ

},



What could be causing this - when I restart the node, the error disappears 
(meaning that the node is able to perform the state transition). What could be 
causing this state mismatch ?



Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-552) StateModelFactory#_stateModelMap should use both resourceName and partitionKey to map a state model

2014-11-18 Thread Zhen Zhang (JIRA)
Zhen Zhang created HELIX-552:


 Summary: StateModelFactory#_stateModelMap should use both 
resourceName and partitionKey to map a state model
 Key: HELIX-552
 URL: https://issues.apache.org/jira/browse/HELIX-552
 Project: Apache Helix
  Issue Type: Bug
Reporter: Zhen Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-551) External view partition states go out of sync

2014-11-18 Thread Zhen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217232#comment-14217232
 ] 

Zhen Zhang commented on HELIX-551:
--

this is related to https://issues.apache.org/jira/browse/HELIX-552

 External view  partition states go out of sync
 ---

 Key: HELIX-551
 URL: https://issues.apache.org/jira/browse/HELIX-551
 Project: Apache Helix
  Issue Type: Bug
Affects Versions: 0.6.4
Reporter: Varun Sharma

 Hi,
 I am seeing the following issue for many partitions in helix using a simple 
 Online-Offline state model factory. The external view says that the 
 partition has been assigned to 3 hosts. However, when I look at the hosts 
 only 1 of them executed the OFFLINE -- ONLINE transition.
 On the hosts, that did not execute the transition, I see the following:
 2014-11-13 09:29:54,394 [pool-3-thread-11] 
 (HelixStateTransitionHandler.java:206) WARN  Force CurrentState on Zk to be 
 stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 
 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, 
 ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, 
 EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, 
 MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, 
 MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, 
 RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, 
 SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, 
 STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, 
 TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, 
 TO_STATE=ONLINE}{}{} 
 When I grep the message ID in the controller, I see the following:
 2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) 
 INFO  {
   id : 149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201,
   mapFields : {
 HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION 
 c1193025-b416-49d7-adc2-10afe2389141 : {
   AdditionalInfo : Message execution failed. msgId: 
 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: 
 org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
  Current state of stateModel does not match the fromState in Message, Current 
 State:ONLINE, message expected:OFFLINE, partition: 490, from: 
 hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256,
   Class : class 
 org.apache.helix.messaging.handling.HelixStateTransitionHandler,
   MSG_ID : 12690ce8-8098-46b1-a93d-279604f0e3db,
   Message state : READ
 },
 What could be causing this - when I restart the node, the error disappears 
 (meaning that the node is able to perform the state transition). What could 
 be causing this state mismatch ?
 Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HELIX-553) While enqueueing a job, task framework should create state model Task if it doesn't already exist

2014-11-18 Thread Karthiek (JIRA)
Karthiek created HELIX-553:
--

 Summary: While enqueueing a job, task framework should create 
state model Task if it doesn't already exist
 Key: HELIX-553
 URL: https://issues.apache.org/jira/browse/HELIX-553
 Project: Apache Helix
  Issue Type: Bug
Reporter: Karthiek


Task framework expects the Task state model to be already defined. Otherwise 
enqueueing a job using ClusterTask framework throws this exception:

org.apache.helix.HelixException: State model Task not found in the cluster 
STATEMODELDEFS path
at 
org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:608)
at 
org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:651)
at 
org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:625)
at 
org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:592)
at org.apache.helix.task.TaskDriver.scheduleJob(TaskDriver.java:327)
at org.apache.helix.task.TaskDriver.enqueueJob(TaskDriver.java:316)
at 
com.linkedin.espresso.bulkmigrator.BulkOperationScheduler.schedule(BulkOperationScheduler.java:98)
at 
com.linkedin.espresso.test.bulkoperation.ScheduleEIJob.main(ScheduleEIJob.java:38)

Existing clusters will not have Task state model already defined. It would be 
really great if the task framework automatically creates it if the state model 
doesn't exist. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.

2014-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217391#comment-14217391
 ] 

ASF GitHub Bot commented on HELIX-550:
--

Github user kanakb commented on a diff in the pull request:

https://github.com/apache/helix/pull/11#discussion_r20557176
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
 ---
@@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String 
clusterName) {
 }
   }
 
+  public void shutdown() throws InterruptedException {
+stopRebalancingTimer();
+while (_eventThread.isAlive())
+{
+  _eventThread.interrupt();
+  _eventThread.join(1000);
--- End diff --

Can you change this to a constant variable?


 ZKHelixManager does not shutdown GenericHelixController threads.
 

 Key: HELIX-550
 URL: https://issues.apache.org/jira/browse/HELIX-550
 Project: Apache Helix
  Issue Type: Bug
Reporter: Antony T Curtis
Priority: Critical

 ZKHelixManager does not shutdown GenericHelixController threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] helix pull request: [HELIX-550] ZKHelixManager should shutdown Gen...

2014-11-18 Thread atcurtis
Github user atcurtis commented on a diff in the pull request:

https://github.com/apache/helix/pull/11#discussion_r20557205
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
 ---
@@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String 
clusterName) {
 }
   }
 
+  public void shutdown() throws InterruptedException {
+stopRebalancingTimer();
+while (_eventThread.isAlive())
+{
+  _eventThread.interrupt();
+  _eventThread.join(1000);
--- End diff --

Sure. Any preference for the constant name?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.

2014-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217396#comment-14217396
 ] 

ASF GitHub Bot commented on HELIX-550:
--

Github user kanakb commented on a diff in the pull request:

https://github.com/apache/helix/pull/11#discussion_r20557259
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
 ---
@@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String 
clusterName) {
 }
   }
 
+  public void shutdown() throws InterruptedException {
+stopRebalancingTimer();
+while (_eventThread.isAlive())
+{
+  _eventThread.interrupt();
+  _eventThread.join(1000);
--- End diff --

Maybe something like `EVENT_THREAD_JOIN_TIMEOUT`?


 ZKHelixManager does not shutdown GenericHelixController threads.
 

 Key: HELIX-550
 URL: https://issues.apache.org/jira/browse/HELIX-550
 Project: Apache Helix
  Issue Type: Bug
Reporter: Antony T Curtis
Priority: Critical

 ZKHelixManager does not shutdown GenericHelixController threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] helix pull request: [HELIX-550] ZKHelixManager should shutdown Gen...

2014-11-18 Thread kanakb
Github user kanakb commented on a diff in the pull request:

https://github.com/apache/helix/pull/11#discussion_r20557289
  
--- Diff: 
helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java ---
@@ -543,6 +554,19 @@ public void disconnect() {
   _zkclient.close();
   _zkclient = null;
   LOG.info(Cluster manager:  + _instanceName +  disconnected);
+
+  if (_controller != null) {
+try {
+  _controller.shutdown();
+}
--- End diff --

nit: can you make the `catch` start on the same line as the close brace of 
the `try`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] helix pull request: [HELIX-550] ZKHelixManager should shutdown Gen...

2014-11-18 Thread kanakb
Github user kanakb commented on the pull request:

https://github.com/apache/helix/pull/11#issuecomment-63591519
  
Merged, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


helix - Build # 1304 - Still Unstable

2014-11-18 Thread Apache Jenkins Server
The Apache Jenkins build system has built helix (build #1304)

Status: Still Unstable

Check console output at https://builds.apache.org/job/helix/1304/ to view the 
results.

Review Request 28215: [HELIX-550] Shutdown GenericHelixController on disconnect (port to master)

2014-11-18 Thread Kanak Biscuitwala

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28215/
---

Review request for helix, Zhen Zhang and Kishore Gopalakrishna.


Bugs: HELIX-550


Repository: helix-git


Description
---

This is a port to master of this PR: 
https://github.com/apache/helix/pull/11/files#diff-866b65f4aa4b4753224ff615eb2efc1eR533

commit bfb4a3d34228f5c3806b1eee9e98f401386e66a9
Author: Kanak Biscuitwala kana...@hotmail.com
Date:   Tue Nov 18 21:23:29 2014 -0800

[HELIX-550] Shutdown GenericHelixController on disconnect

:100644 100644 aef636e... 113cace... M  
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
:100644 100644 295b69c... fafe604... M  
helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixController.java


Diffs
-

  
helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java
 6fa3d05 
  helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixController.java 
295b69c 

Diff: https://reviews.apache.org/r/28215/diff/


Testing
---


Thanks,

Kanak Biscuitwala