[
https://issues.apache.org/jira/browse/FLINK-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566710#comment-16566710
]
ASF GitHub Bot commented on FLINK-9861:
---------------------------------------
zentol opened a new pull request #6478: [FLINK-9861][tests] Add
StreamingFileSink E2E test
URL: https://github.com/apache/flink/pull/6478
## What is the purpose of the change
This PR adds an end-to-end test for the streaming file sink.
The source of the test program emits 60k elements over 60 seconds, after
which it idles until the job is canceled. Records are `Tuple2<Integer,
Integer>`, where `f0` is a key between 0 and 9 and `f1` a sequence counter from
0 to 59999. The current sequence count is checkpointed.
The sink of the test program uses a custom encoder&bucketer, and rolls on
each checkpoint.
Test behavior:
1. submit job
2. wait for some checkpoints
3. kill 1 TM
4. wait for restart
5. kill 2 TMs
6. wait for restart
7. wait until the sink no longer creates new files
8. cancel job
9. verify result
Step 2 ensures that some state is actually checkpointed, as otherwise the
source could emit the entire output again. The current checkpoint count is
retrieved through the REST API:
Step 3-6 induce failures. We specifically wait for restarts to ensure that
the job restarts twice. The current number of restarts is retrieved through the
metrics REST API.
Step 7-8 are a work-around for the behavior of checkpointing sinks for
finite streams. For finite streams it usually happens that any data received
between the last checkpoint and job termination is not committed. So we don't
have to deal with this we let the job run until the sinks are no longer
creating new files for checkpoints. Files are only created if records are
received, so if no file is created by any sink we can imply that none have
received a record, which should mean that there is in fact no more data to
process. Generally speaking this is just an assumption, but given the
checkpoint interval of 5 seconds it is highly unlikely that no records makes it
to the sink in time.
Step 9 verifies the result by concatenating and sorting all committed files,
which should result in the sequence [0, 60000).
If any data was lost by the sink the sequence will be interrupted. If any
data was committed twice the sequence will contain a duplicate entry.
## Verifying this change
https://travis-ci.org/zentol/flink-ci/builds/411211048
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add end-to-end test for reworked BucketingSink
> ----------------------------------------------
>
> Key: FLINK-9861
> URL: https://issues.apache.org/jira/browse/FLINK-9861
> Project: Flink
> Issue Type: Sub-task
> Components: Streaming Connectors
> Affects Versions: 1.6.0
> Reporter: Till Rohrmann
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.6.0
>
>
> We should add a end-to-end test for the reworked BucketingSink to verify that
> the sink works with different {{FileSystems}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)