Sam Whittle created BEAM-11216:
----------------------------------
Summary: StreamingDataflowWorker ReaderCache usage can be
incorrect in presence of retries
Key: BEAM-11216
URL: https://issues.apache.org/jira/browse/BEAM-11216
Project: Beam
Issue Type: Bug
Components: runner-dataflow
Reporter: Sam Whittle
This is similar to BEAM-7547. The issue and identified there was related to
the state cache. However for UnboundedSource there is a separate ReaderCache.
In particular the following sequence could lead to using a Reader at the wrong
position.
1. Work arrives indicating to use reader at state C1
2. Work processes and advances reader to C2, commit is prepared with elements
output from C1 to C2 and reader checkpoint for C2
3. Commit of original processing fails (perhaps routed to previous backend
during autoscaling)
4. Work is retried (still indicating to use reader at C1 and still same cache
token) however the ReaderCache is used, there is a hit, and the existing reader
(positioned at C2) is used.
5. Retry of work processes and advances reader to C3, commit is scheduled with
elements output from C2 to C3 and reader checkpoint for C3
6. Commit succeeds
At this point there was never a successful commit for the elements between C1
to C2, though the reader is now advanced past them.
Possible fixes:
1. Use increasing work token as in BEAM-7547 to detect retry and not use the
ReaderCache entry
2. Seek recovered reader from cache to checkpoint or only use if it matches the
checkpoint. This would probably involve changing the Checkpoint interface so 1
is likely preferred.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)