Sam Whittle created BEAM-11216:
----------------------------------

             Summary: StreamingDataflowWorker ReaderCache usage can be 
incorrect in presence of retries
                 Key: BEAM-11216
                 URL: https://issues.apache.org/jira/browse/BEAM-11216
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow
            Reporter: Sam Whittle


This is similar to BEAM-7547.  The issue and identified there was related to 
the state cache.  However for UnboundedSource there is a separate ReaderCache.  
In particular the following sequence could lead to using a Reader at the wrong 
position.

1. Work arrives indicating to use reader at state C1
2. Work processes and advances reader to C2, commit is prepared with elements 
output from C1 to C2 and reader checkpoint for C2
3. Commit of original processing fails (perhaps routed to previous backend 
during autoscaling)
4. Work is retried (still indicating to use reader at C1 and still same cache 
token) however the ReaderCache is used, there is a hit, and the existing reader 
(positioned at C2) is used.
5. Retry of work processes and advances reader to C3, commit is scheduled with 
elements output from C2 to C3 and reader checkpoint for C3
6. Commit succeeds

At this point there was never a successful commit for the elements between C1 
to C2, though the reader is now advanced past them.

Possible fixes:
1. Use increasing work token as in BEAM-7547 to detect retry and not use the 
ReaderCache entry
2. Seek recovered reader from cache to checkpoint or only use if it matches the 
checkpoint. This would probably involve changing the Checkpoint interface so 1 
is likely preferred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to