zentol commented on a change in pull request #7753: [FLINK-11041][tests] Fix
ReinterpretDataStreamAsKeyedStreamITCase
URL: https://github.com/apache/flink/pull/7753#discussion_r258460210
##########
File path:
flink-streaming-java/src/test/java/org/apache/flink/streaming/api/datastream/ReinterpretDataStreamAsKeyedStreamITCase.java
##########
@@ -207,33 +213,34 @@ public void close() throws Exception {
@Override
public void run(SourceContext<Tuple2<Integer, Integer>> out)
throws Exception {
- this.running = true;
- try {
- while (running) {
- checkFail();
+ running = true;
- synchronized (out.getCheckpointLock()) {
- Integer key = din.readInt();
- Integer val = din.readInt();
- out.collect(new Tuple2<>(key,
val));
+ while (running && hasMoreDataToRead()) {
- position += 2 * Integer.BYTES;
- }
+ synchronized (out.getCheckpointLock()) {
+ position += 2 * Integer.BYTES;
Review comment:
I must be missing something. Let's take the original code:
```
checkFail();
synchronized (out.getCheckpointLock()) {
Integer key = din.readInt();
Integer val = din.readInt();
out.collect(new Tuple2<>(key, val));
position += 2 * Integer.BYTES;
}
```
Let's use a simple example: say we only write a single integer pair for each
subtask, i.e. each file partition has a size of `2 * Integer.BYTES`.
On the initial run, `position` is set to 0 after the open/initializeState.
If we fail at the start of the first `run` loop iteration we snapshot 0,
otherwise we read 2 integers and set `position` to `2 * Integer.BYTES`.
If we fail at the start of the second `run` loop iteration we snapshot `2 *
Integer.BYTES`, otherwise we run into EOF, wait for the checkpoint to complete,
but position will still be set to `2 * Integer.BYTES` and snapshotted
accordingly.
When we restore `position` is either 0, in case we never read anything in
the initial run, or `2 * Integer.BYTES` in all other cases, in which case we
should always run into an EOF right away as `fileSize == 2 * Integer.BYTES`.
I can't find a case where we emit data and don't update `position`, but this
is the condition for producing a duplicate.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services