[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940588#comment-15940588 ] Thomas Weise commented on APEXMALHAR-2419: -- Missing fix version. > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath >Assignee: Sandesh > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} > Note: > It is failing because of the bug in the WindowDataManager, > https://issues.apache.org/jira/browse/APEXMALHAR-2422 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879182#comment-15879182 ] Thomas Weise commented on APEXMALHAR-2419: -- Completed windows (endWindow) are supposed to be flushed to WAL, irrespective of committed. > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath >Assignee: Sandesh > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878951#comment-15878951 ] Sandesh commented on APEXMALHAR-2419: - Largest completed window is moving only after a committed call on WindowDataManager but the expected behavior is, Largest completed Window is the last saved Window ID. Current behaviour breaks not only this operator but also all the operators that use this operator. > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath >Assignee: Sandesh > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878506#comment-15878506 ] Munagala V. Ramanath commented on APEXMALHAR-2419: -- To clarify, the point of the logs above is manifold: 1. The semantics of getLargestCompletedWindow() call are unclear -- what does "completed" mean ? We have well defined meanings for "committed" and "checkpointed" windows but not for "completed"; the only reasonable meaning I can think of is: a window in which endWindow() has been called. It appears that operators are using this call with this expectation. 2. If the above interpretation is correct, we expect this call to return a new value after each endWindow() call but that is not happening -- the value is constant for the entire duration of a single run. 2. The operators that use WindowDataManager need to know when they are processing replayed windows and when replay is finished. Given the current behavior, there is currently no way for operators to determine when replay is done using WindowDataManager. > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath >Assignee: Sandesh > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878463#comment-15878463 ] Munagala V. Ramanath commented on APEXMALHAR-2419: -- [~sandesh] Does "recovery is not happening" mean (a) This is the first run of the operator and it has not gone down; or (b) This may be a restarted run but we've finished replaying all the old windows are are now processing new windows ? largestCompletedWindow will be -1 for (a) but not for (b) > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath >Assignee: Sandesh > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876656#comment-15876656 ] Sandesh commented on APEXMALHAR-2419: - Fix is here, until it is merged. https://github.com/sandeshh/apex-malhar/tree/APEXMALHAR-2419 > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath >Assignee: Sandesh > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876641#comment-15876641 ] Sandesh commented on APEXMALHAR-2419: - largestCompletedWindow will be -1, if the recovery is not happening. > KafkaSinglePortExactlyOnceOutputOperator fails on recovery > -- > > Key: APEXMALHAR-2419 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2419 > Project: Apache Apex Malhar > Issue Type: Bug >Reporter: Munagala V. Ramanath > > The KafkaSinglePortExactlyOnceOutputOperator fails on recovery with the > message: "Violates Exactly once. Not all the tuples received after operator > reset." > This is because of this check in endWindow(): > {code} >if (!partialWindowTuples.isEmpty() && windowId > > windowDataManager.getLargestCompletedWindow()) { > throw new RuntimeException("Violates Exactly once. Not all the tuples > received after operator reset."); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2419) KafkaSinglePortExactlyOnceOutputOperator fails on recovery
[ https://issues.apache.org/jira/browse/APEXMALHAR-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876082#comment-15876082 ] Munagala V. Ramanath commented on APEXMALHAR-2419: -- The expectation seems to be that `windowDataManager.getLargestCompletedWindow()` will be the last window on which `windowDataManager.save()` was called in the previous failed run which in turn means that `windowId > windowDataManager.getLargestCompletedWindow()` will be true only when replay is complete. However as the following log messages show, the `windowDataManager.getLargestCompletedWindow()` always returns the same value during a run -- this may indicate a bug in FSWindowDataManager. Notice in the initial deploy, the largestCompletedWindow is always -1 even after a save() call in endWindow: Initial deploy: {quote} 2017-02-21 06:18:55,196 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705345, largestCompletedWindow = -1 2017-02-21 06:18:55,240 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: windowId = 6389565783323705345, partialWindowTuples.size = 0, largestCompletedWindow = -1 2017-02-21 06:18:55,747 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: saved data for windowId = 6389565783323705345, largestCompletedWindow = -1 2017-02-21 06:18:55,749 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705346, largestCompletedWindow = -1 2017-02-21 06:18:55,749 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: windowId = 6389565783323705346, partialWindowTuples.size = 0, largestCompletedWindow = -1 2017-02-21 06:18:55,806 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: saved data for windowId = 6389565783323705346, largestCompletedWindow = -1 2017-02-21 06:18:55,806 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705347, largestCompletedWindow = -1 2017-02-21 06:18:55,806 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: windowId = 6389565783323705347, partialWindowTuples.size = 0, largestCompletedWindow = -1 2017-02-21 06:18:55,906 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: saved data for windowId = 6389565783323705347, largestCompletedWindow = -1 2017-02-21 06:18:55,906 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705348, largestCompletedWindow = -1 2017-02-21 06:18:55,906 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: windowId = 6389565783323705348, partialWindowTuples.size = 0, largestCompletedWindow = -1 2017-02-21 06:18:55,964 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: saved data for windowId = 6389565783323705348, largestCompletedWindow = -1 2017-02-21 06:18:55,964 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705349, largestCompletedWindow = -1 2017-02-21 06:18:55,965 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: windowId = 6389565783323705349, partialWindowTuples.size = 0, largestCompletedWindow = -1 2017-02-21 06:18:56,022 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: saved data for windowId = 6389565783323705349, largestCompletedWindow = -1 2017-02-21 06:18:56,023 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705350, largestCompletedWindow = -1 2017-02-21 06:19:03,570 INFO com.datatorrent.stram.engine.StreamingContainer: Undeploy request: [3] {quote} Notice here that when we reach the second replay window, it is not recognized as replay and throws the exception. After redeploy: {quote} 2017-02-21 06:19:11,899 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: beginWindow: windowId = 6389565783323705345, largestCompletedWindow = 6389565783323705345 2017-02-21 06:19:11,900 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: Rebuild the partial window after 6389565783323705345 2017-02-21 06:19:13,151 INFO org.apache.apex.malhar.kafka.KafkaSinglePortExactlyOnceOutputOperator: endWindow: windowId = 6389565783323705345, partialWindowTuples.size = 4, largestCompletedWindow = 6389565783323705345 2017-02-21 06:19:13,766 INFO