[jira] [Created] (HELIX-551) External view partition states go out of sync
Varun Sharma created HELIX-551: -- Summary: External view partition states go out of sync Key: HELIX-551 URL: https://issues.apache.org/jira/browse/HELIX-551 Project: Apache Helix Issue Type: Bug Affects Versions: 0.6.4 Reporter: Varun Sharma Hi, I am seeing the following issue for many partitions in helix using a simple Online-Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE -- ONLINE transition. On the hosts, that did not execute the transition, I see the following: 2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{} When I grep the message ID in the controller, I see the following: 2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO { id : 149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201, mapFields : { HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141 : { AdditionalInfo : Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256, Class : class org.apache.helix.messaging.handling.HelixStateTransitionHandler, MSG_ID : 12690ce8-8098-46b1-a93d-279604f0e3db, Message state : READ }, What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ? Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HELIX-552) StateModelFactory#_stateModelMap should use both resourceName and partitionKey to map a state model
Zhen Zhang created HELIX-552: Summary: StateModelFactory#_stateModelMap should use both resourceName and partitionKey to map a state model Key: HELIX-552 URL: https://issues.apache.org/jira/browse/HELIX-552 Project: Apache Helix Issue Type: Bug Reporter: Zhen Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-551) External view partition states go out of sync
[ https://issues.apache.org/jira/browse/HELIX-551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217232#comment-14217232 ] Zhen Zhang commented on HELIX-551: -- this is related to https://issues.apache.org/jira/browse/HELIX-552 External view partition states go out of sync --- Key: HELIX-551 URL: https://issues.apache.org/jira/browse/HELIX-551 Project: Apache Helix Issue Type: Bug Affects Versions: 0.6.4 Reporter: Varun Sharma Hi, I am seeing the following issue for many partitions in helix using a simple Online-Offline state model factory. The external view says that the partition has been assigned to 3 hosts. However, when I look at the hosts only 1 of them executed the OFFLINE -- ONLINE transition. On the hosts, that did not execute the transition, I see the following: 2014-11-13 09:29:54,394 [pool-3-thread-11] (HelixStateTransitionHandler.java:206) WARN Force CurrentState on Zk to be stateModel's CurrentState. partitionKey: 490, currentState: ONLINE, message: 12690ce8-8098-46b1-a93d-279604f0e3db, {CREATE_TIMESTAMP=1415870993349, ClusterEventName=idealStateChange, EXECUTE_START_TIMESTAMP=1415870994382, EXE_SESSION_ID=149a14ada0d0013, FROM_STATE=OFFLINE, MSG_ID=12690ce8-8098-46b1-a93d-279604f0e3db, MSG_STATE=read, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=490, READ_TIMESTAMP=1415870993787, RESOURCE_NAME=$terrapin$data$meta_pin_join$1415866960201, SRC_NAME=hdfsterrapin-a-namenode001_9090, SRC_SESSION_ID=147a7beb2dd8ed7, STATE_MODEL_DEF=OnlineOffline, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=hdfsterrapin-a-datanode-ba3ad256, TGT_SESSION_ID=149a14ada0d0013, TO_STATE=ONLINE}{}{} When I grep the message ID in the controller, I see the following: 2014-11-14 09:34:56,265 [StatusDumpTimerTask] (ZKPathDataDumpTask.java:155) INFO { id : 149a14ada0d0013__$terrapin$data$meta_pin_join$1415866960201, mapFields : { HELIX_ERROR 20141113-092954.000419 STATE_TRANSITION c1193025-b416-49d7-adc2-10afe2389141 : { AdditionalInfo : Message execution failed. msgId: 12690ce8-8098-46b1-a93d-279604f0e3db, errorMsg: org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException: Current state of stateModel does not match the fromState in Message, Current State:ONLINE, message expected:OFFLINE, partition: 490, from: hdfsterrapin-a-namenode001_9090, to: hdfsterrapin-a-datanode-ba3ad256, Class : class org.apache.helix.messaging.handling.HelixStateTransitionHandler, MSG_ID : 12690ce8-8098-46b1-a93d-279604f0e3db, Message state : READ }, What could be causing this - when I restart the node, the error disappears (meaning that the node is able to perform the state transition). What could be causing this state mismatch ? Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HELIX-553) While enqueueing a job, task framework should create state model Task if it doesn't already exist
Karthiek created HELIX-553: -- Summary: While enqueueing a job, task framework should create state model Task if it doesn't already exist Key: HELIX-553 URL: https://issues.apache.org/jira/browse/HELIX-553 Project: Apache Helix Issue Type: Bug Reporter: Karthiek Task framework expects the Task state model to be already defined. Otherwise enqueueing a job using ClusterTask framework throws this exception: org.apache.helix.HelixException: State model Task not found in the cluster STATEMODELDEFS path at org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:608) at org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:651) at org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:625) at org.apache.helix.manager.zk.ZKHelixAdmin.addResource(ZKHelixAdmin.java:592) at org.apache.helix.task.TaskDriver.scheduleJob(TaskDriver.java:327) at org.apache.helix.task.TaskDriver.enqueueJob(TaskDriver.java:316) at com.linkedin.espresso.bulkmigrator.BulkOperationScheduler.schedule(BulkOperationScheduler.java:98) at com.linkedin.espresso.test.bulkoperation.ScheduleEIJob.main(ScheduleEIJob.java:38) Existing clusters will not have Task state model already defined. It would be really great if the task framework automatically creates it if the state model doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217391#comment-14217391 ] ASF GitHub Bot commented on HELIX-550: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557176 --- Diff: helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java --- @@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String clusterName) { } } + public void shutdown() throws InterruptedException { +stopRebalancingTimer(); +while (_eventThread.isAlive()) +{ + _eventThread.interrupt(); + _eventThread.join(1000); --- End diff -- Can you change this to a constant variable? ZKHelixManager does not shutdown GenericHelixController threads. Key: HELIX-550 URL: https://issues.apache.org/jira/browse/HELIX-550 Project: Apache Helix Issue Type: Bug Reporter: Antony T Curtis Priority: Critical ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] helix pull request: [HELIX-550] ZKHelixManager should shutdown Gen...
Github user atcurtis commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557205 --- Diff: helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java --- @@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String clusterName) { } } + public void shutdown() throws InterruptedException { +stopRebalancingTimer(); +while (_eventThread.isAlive()) +{ + _eventThread.interrupt(); + _eventThread.join(1000); --- End diff -- Sure. Any preference for the constant name? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (HELIX-550) ZKHelixManager does not shutdown GenericHelixController threads.
[ https://issues.apache.org/jira/browse/HELIX-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217396#comment-14217396 ] ASF GitHub Bot commented on HELIX-550: -- Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557259 --- Diff: helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java --- @@ -573,6 +573,15 @@ public void shutdownClusterStatusMonitor(String clusterName) { } } + public void shutdown() throws InterruptedException { +stopRebalancingTimer(); +while (_eventThread.isAlive()) +{ + _eventThread.interrupt(); + _eventThread.join(1000); --- End diff -- Maybe something like `EVENT_THREAD_JOIN_TIMEOUT`? ZKHelixManager does not shutdown GenericHelixController threads. Key: HELIX-550 URL: https://issues.apache.org/jira/browse/HELIX-550 Project: Apache Helix Issue Type: Bug Reporter: Antony T Curtis Priority: Critical ZKHelixManager does not shutdown GenericHelixController threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] helix pull request: [HELIX-550] ZKHelixManager should shutdown Gen...
Github user kanakb commented on a diff in the pull request: https://github.com/apache/helix/pull/11#discussion_r20557289 --- Diff: helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java --- @@ -543,6 +554,19 @@ public void disconnect() { _zkclient.close(); _zkclient = null; LOG.info(Cluster manager: + _instanceName + disconnected); + + if (_controller != null) { +try { + _controller.shutdown(); +} --- End diff -- nit: can you make the `catch` start on the same line as the close brace of the `try`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] helix pull request: [HELIX-550] ZKHelixManager should shutdown Gen...
Github user kanakb commented on the pull request: https://github.com/apache/helix/pull/11#issuecomment-63591519 Merged, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
helix - Build # 1304 - Still Unstable
The Apache Jenkins build system has built helix (build #1304) Status: Still Unstable Check console output at https://builds.apache.org/job/helix/1304/ to view the results.
Review Request 28215: [HELIX-550] Shutdown GenericHelixController on disconnect (port to master)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28215/ --- Review request for helix, Zhen Zhang and Kishore Gopalakrishna. Bugs: HELIX-550 Repository: helix-git Description --- This is a port to master of this PR: https://github.com/apache/helix/pull/11/files#diff-866b65f4aa4b4753224ff615eb2efc1eR533 commit bfb4a3d34228f5c3806b1eee9e98f401386e66a9 Author: Kanak Biscuitwala kana...@hotmail.com Date: Tue Nov 18 21:23:29 2014 -0800 [HELIX-550] Shutdown GenericHelixController on disconnect :100644 100644 aef636e... 113cace... M helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java :100644 100644 295b69c... fafe604... M helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixController.java Diffs - helix-core/src/main/java/org/apache/helix/controller/GenericHelixController.java 6fa3d05 helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixController.java 295b69c Diff: https://reviews.apache.org/r/28215/diff/ Testing --- Thanks, Kanak Biscuitwala