zentol opened a new pull request #6478: [FLINK-9861][tests] Add StreamingFileSink E2E test URL: https://github.com/apache/flink/pull/6478 ## What is the purpose of the change This PR adds an end-to-end test for the streaming file sink. The source of the test program emits 60k elements over 60 seconds, after which it idles until the job is canceled. Records are `Tuple2<Integer, Integer>`, where `f0` is a key between 0 and 9 and `f1` a sequence counter from 0 to 59999. The current sequence count is checkpointed. The sink of the test program uses a custom encoder&bucketer, and rolls on each checkpoint. Test behavior: 1. submit job 2. wait for some checkpoints 3. kill 1 TM 4. wait for restart 5. kill 2 TMs 6. wait for restart 7. wait until the sink no longer creates new files 8. cancel job 9. verify result Step 2 ensures that some state is actually checkpointed, as otherwise the source could emit the entire output again. The current checkpoint count is retrieved through the REST API: Step 3-6 induce failures. We specifically wait for restarts to ensure that the job restarts twice. The current number of restarts is retrieved through the metrics REST API. Step 7-8 are a work-around for the behavior of checkpointing sinks for finite streams. For finite streams it usually happens that any data received between the last checkpoint and job termination is not committed. So we don't have to deal with this we let the job run until the sinks are no longer creating new files for checkpoints. Files are only created if records are received, so if no file is created by any sink we can imply that none have received a record, which should mean that there is in fact no more data to process. Generally speaking this is just an assumption, but given the checkpoint interval of 5 seconds it is highly unlikely that no records makes it to the sink in time. Step 9 verifies the result by concatenating and sorting all committed files, which should result in the sequence [0, 60000). If any data was lost by the sink the sequence will be interrupted. If any data was committed twice the sequence will contain a duplicate entry. ## Verifying this change https://travis-ci.org/zentol/flink-ci/builds/411211048
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
