[ 
https://issues.apache.org/jira/browse/APEXCORE-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Yan updated APEXCORE-313:
-------------------------------
    Description: 
When an operator dies, the output data for that operator in buffer server 
should be invalidated.  Currently it's not and unless we do this: 
{code}
localCluster.setPerContainerBufferServer(true);
{code}
it's possible for a newly recovered operator to get the ghost data from an 
upstream operator in the same checkpoint group that is still in the process of 
recovering.  When the upstream operator finally recovers, it tries to send the 
data from the recovery checkpoint that is duplicate of the ghost data, thus 
putting the whole thing in a bad state.

How to reproduce:

In DelayOperatorTest.java, comment out the lines with 
localCluster.setPerContainerBufferServer(true), and run testFibonacciRecovery1, 
at recovery, the FIB operator becomes blocked because of this problem.  STRAM 
detects the blocked operator after 30 seconds and redeploys the operators again 
and things go back to normal.  The unit test eventually passes but the recovery 
takes more than 30 seconds because of this problem.

  was:
When an operator dies, the output data for that operator in buffer server 
should be invalidated.  Currently it's not and unless we do this: 
{code}
localCluster.setPerContainerBufferServer(true);
{code}
it's possible for a newly recovered operator to get the ghost data from an 
upstream operator in the same checkpoint group that is still in the process of 
recovering.  When the upstream operator finally recovers, it tries to send the 
data from the recovery checkpoint that is duplicate of the ghost data, thus 
putting the whole thing in a bad state.



> BufferServer not purged correctly in StramLocalCluster 
> -------------------------------------------------------
>
>                 Key: APEXCORE-313
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-313
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: David Yan
>
> When an operator dies, the output data for that operator in buffer server 
> should be invalidated.  Currently it's not and unless we do this: 
> {code}
> localCluster.setPerContainerBufferServer(true);
> {code}
> it's possible for a newly recovered operator to get the ghost data from an 
> upstream operator in the same checkpoint group that is still in the process 
> of recovering.  When the upstream operator finally recovers, it tries to send 
> the data from the recovery checkpoint that is duplicate of the ghost data, 
> thus putting the whole thing in a bad state.
> How to reproduce:
> In DelayOperatorTest.java, comment out the lines with 
> localCluster.setPerContainerBufferServer(true), and run 
> testFibonacciRecovery1, at recovery, the FIB operator becomes blocked because 
> of this problem.  STRAM detects the blocked operator after 30 seconds and 
> redeploys the operators again and things go back to normal.  The unit test 
> eventually passes but the recovery takes more than 30 seconds because of this 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to