[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed after zookeeper session reconnect since it tried to generate JobModel for non-existent processors([1, 2]). {code:java} [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got syncconnected event for processor=01. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: ZK_SESSION_ERROR 15:16:50.417 [DEBUG] [TestEventLogger] 540098 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa- d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2 15:16:50.415 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these children - [01, 02] 15:16:50.416 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these processorIds [debounce-thread-0] INFO org.apache.samza.zk.ZkJobCoordinator - Generating new JobModel with processors: []. [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2 [debounce-thread-0] INFO org.apache.samza.coordinator.JobModelManager$ - SystemStreamPartitionGrouper org.apache.samza.container.grouper.stream.GroupByPartition@33169a55 has grouped the
[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed after zookeeper session reconnect since it tried to generate JobModel for non-existent processors([1, 2]). {code:java} [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got syncconnected event for processor=01. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: ZK_SESSION_ERROR 15:16:50.417 [DEBUG] [TestEventLogger] 540098 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa- d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2 15:16:50.415 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these children - [01, 02] 15:16:50.416 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these processorIds [debounce-thread-0] INFO org.apache.samza.zk.ZkJobCoordinator - Generating new JobModel with processors: []. [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2 [debounce-thread-0] INFO org.apache.samza.coordinator.JobModelManager$ - SystemStreamPartitionGrouper org.apache.samza.container.grouper.stream.GroupByPartition@33169a55 has grouped the
[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed after zookeeper session reconnect since it tried to generate JobModel for non-existent processors([1, 2]). {code:java} [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 15:16:50.417 [DEBUG] [TestEventLogger] 540098 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa- d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2```[ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got syncconnected event for processor=01. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: ZK_SESSION_ERROR 15:16:50.415 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these children - [01, 02] 15:16:50.416 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these processorIds [debounce-thread-0] INFO org.apache.samza.zk.ZkJobCoordinator - Generating new JobModel with processors: []. [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2 [debounce-thread-0] INFO org.apache.samza.coordinator.JobModelManager$ - SystemStreamPartitionGrouper org.apache.samza.container.grouper.stream.GroupByPartition@33169a55 has grouped the
[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed after zookeeper session reconnect since it tried to generate JobModel for non-existent processors([1, 2]). {code:java} ``` [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 15:16:50.417 [DEBUG] [TestEventLogger] 540098 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa- d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2```[ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got syncconnected event for processor=01. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: ZK_SESSION_ERROR 15:16:50.415 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these children - [01, 02] 15:16:50.416 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these processorIds [debounce-thread-0] INFO org.apache.samza.zk.ZkJobCoordinator - Generating new JobModel with processors: []. [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2 [debounce-thread-0] INFO org.apache.samza.coordinator.JobModelManager$ - SystemStreamPartitionGrouper org.apache.samza.container.grouper.stream.GroupByPartition@33169a55 has grouped the
[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed after zookeeper session reconnect since it tried to generate JobModel for non-existent processors([1, 2]). {code:java} ``` [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] ```651519 15:16:29.792 [DEBUG] [TestEventLogger] 519474 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] 651520 15:16:29.792 [DEBUG] [TestEventLogger] 519474 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] 651521 15:16:29.792 [DEBUG] [TestEventLogger] 519474 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! 651522 15:16:29.792 [DEBUG] [TestEventLogger] 519474 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 651523 15:16:29.793 [DEBUG] [TestEventLogger] 519475 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. 651524 15:16:29.794 [DEBUG] [TestEventLogger] 519476 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 15:16:50.417 [DEBUG] [TestEventLogger] 540098 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model ver=2 from /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa- d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModels/2```[ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got syncconnected event for processor=01. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: ZK_SESSION_ERROR 15:16:50.415 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these children - [01, 02] 15:16:50.416 [DEBUG] [TestEventLogger] 540096 [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - Found these processorIds [debounce-thread-0] INFO org.apache.samza.zk.ZkJobCoordinator - Generating new JobModel with processors: []. [debounce-thread-0] INFO org.apache.samza.zk.ZkUtils - read the model
[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed after zookeeper session reconnect since it tried to generate JobModel for non-existent processors([1, 2]). {code:java} ``` [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ZkControllerImpl - ZkControllerImpl::ProcessorChangeHandler::handleChildChange - Path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f819 3/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors Current Children: [03, 04] [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - ZkJobCoordinator::onProcessorChange - list of processors changed! List size=2 [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: OnProcessorChange [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Scheduled action: OnProcessorChange to run after: 3 milliseconds. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - subscribing for jm version change at:/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d -e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/JobModelGeneration/jobModelVersion [ZkClient-EventThread-291-127.0.0.1:53447] WARN org.apache.samza.zk.ZkUtils - SKIPPING handleDataChanged for ProcessorChangeHandler from wrong
[jira] [Updated] (SAMZA-1663) StreamProcessor fails after reestablishing zookeeper session.
[ https://issues.apache.org/jira/browse/SAMZA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shanthoosh Venkataraman updated SAMZA-1663: --- Description: *Problem:* StreamProcessor fails sporadically after reestablishing the zookeeper session. *Analysis:* Currently in ZkJobCoordinator, buffered stale events in the leader stream processor before a zookeeper session disconnect causes the StreamProcessor to fail after session reconnect. Consider the following scenario: A. Leader receives processor change notification from zookeeper. B. Before it acts on the notification, zookeeper session expiration occurs. C. Leader stream processor disconnects from the group and joins the group as a follower. D. It acts upon the stale events which were buffered in its worker queue((generated for some non-existent processors)), which ends up killing it. *Relevant stacktrace:* As we can see in this case, before the session expiration there are two processorIds viz [1, 2]. After the session expiration, alive processorIds were [3, 4]. StreamProcessor failed since the leader tries to generate JobModel based upon stale processorIds [1, 2] after zookeeper session reconnect. {code:java} ``` [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - Got new session created event for processor=01 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - register zk controller for the new session [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Created ephemeral path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 for processor: svenkata-mn1.linkedin.biz 01 in zookeeper. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-329-127.0.0.1:53447] WARN org.apache.samza.zk.ZkUtils - ALLOWING handleDataChanged for ProcessorChangeHandler from wrong generation. current generation=1; callback generation= 1 [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ZkControllerImpl - ZkControllerImpl::ProcessorChangeHandler::handleChildChange - Path: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f819 3/test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors Current Children: [03, 04] [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ZkJobCoordinator - ZkJobCoordinator::onProcessorChange - list of processors changed! List size=2 [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Attempting to cancel the future of action: OnProcessorChange [ZkClient-EventThread-329-127.0.0.1:53447] INFO org.apache.samza.zk.ScheduleAfterDebounceTime - Scheduled action: OnProcessorChange to run after: 3 milliseconds. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Found these children - [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - tryBecomeLeader: index = 1 for path=/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452 c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData/processors/04 out of [03, 04] [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - Index = 1 Not eligible to be a leader yet! [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Subscribing data change for 03 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkLeaderElector - [Processor-01] Predecessor still exists. Current subscription is valid. Continuing as non-leader. [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - Current version for zk root node: /app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d-e96 25c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-coordinationData is 1.0, expected version is 1.0 [ZkClient-EventThread-291-127.0.0.1:53447] INFO org.apache.samza.zk.ZkUtils - subscribing for jm version change at:/app-test-app-name-e4bb9daa-d00d-452c-bf7d-e9625c8f8193-test-app-id-e4bb9daa-d00d-452c-bf7d-e9625c8f8193/test-app-name-e4bb9daa-d00d-452c-bf7d