[
https://issues.apache.org/jira/browse/APEXCORE-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Yan updated APEXCORE-313:
-------------------------------
Description:
When an operator dies, the output data for that operator in buffer server
should be invalidated. Currently it's not and unless we do this:
{code}
localCluster.setPerContainerBufferServer(true);
{code}
it's possible for a newly recovered operator to get the ghost data from an
upstream operator in the same checkpoint group that is still in the process of
recovering. When the upstream operator finally recovers, it tries to send the
data from the recovery checkpoint that is duplicate of the ghost data, thus
putting the whole thing in a bad state.
How to reproduce:
In DelayOperatorTest.java, comment out the lines with
localCluster.setPerContainerBufferServer(true), and run testFibonacciRecovery1,
at recovery, the FIB operator becomes blocked because of this problem. STRAM
detects the blocked operator after 30 seconds and redeploys the operators again
and things go back to normal. The unit test eventually passes but the recovery
takes more than 30 seconds because of this problem.
was:
When an operator dies, the output data for that operator in buffer server
should be invalidated. Currently it's not and unless we do this:
{code}
localCluster.setPerContainerBufferServer(true);
{code}
it's possible for a newly recovered operator to get the ghost data from an
upstream operator in the same checkpoint group that is still in the process of
recovering. When the upstream operator finally recovers, it tries to send the
data from the recovery checkpoint that is duplicate of the ghost data, thus
putting the whole thing in a bad state.
> BufferServer not purged correctly in StramLocalCluster
> -------------------------------------------------------
>
> Key: APEXCORE-313
> URL: https://issues.apache.org/jira/browse/APEXCORE-313
> Project: Apache Apex Core
> Issue Type: Bug
> Reporter: David Yan
>
> When an operator dies, the output data for that operator in buffer server
> should be invalidated. Currently it's not and unless we do this:
> {code}
> localCluster.setPerContainerBufferServer(true);
> {code}
> it's possible for a newly recovered operator to get the ghost data from an
> upstream operator in the same checkpoint group that is still in the process
> of recovering. When the upstream operator finally recovers, it tries to send
> the data from the recovery checkpoint that is duplicate of the ghost data,
> thus putting the whole thing in a bad state.
> How to reproduce:
> In DelayOperatorTest.java, comment out the lines with
> localCluster.setPerContainerBufferServer(true), and run
> testFibonacciRecovery1, at recovery, the FIB operator becomes blocked because
> of this problem. STRAM detects the blocked operator after 30 seconds and
> redeploys the operators again and things go back to normal. The unit test
> eventually passes but the recovery takes more than 30 seconds because of this
> problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)