[
https://issues.apache.org/jira/browse/FLINK-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002915#comment-17002915
]
Kostas Kloudas commented on FLINK-14843:
----------------------------------------
The fact that the duplicates can be the result of non-truncated data from
before the failure sounds correct. {{Truncate}} is called upon recovery so we
have to wait for all tasks to have successfully recovered before counting the
valid records.
I think the solution is to have *no* size limit for the pending files and to
wait after recovery so that we have a high chance of having called
{{truncate()}} after recovery. The no size requirement is due to the fact that
pending files are never garbage collected, so if they are created and they do
not belong to any checkpoint, then they will always affect the number of
counted records.
> Streaming bucketing end-to-end test can fail with Output hash mismatch
> ----------------------------------------------------------------------
>
> Key: FLINK-14843
> URL: https://issues.apache.org/jira/browse/FLINK-14843
> Project: Flink
> Issue Type: Bug
> Components: Connectors / FileSystem, Tests
> Affects Versions: 1.10.0
> Environment: rev: dcc1330375826b779e4902176bb2473704dabb11
> Reporter: Gary Yao
> Assignee: PengFei Li
> Priority: Critical
> Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: complete_result,
> flink-gary-standalonesession-0-gyao-desktop.log,
> flink-gary-taskexecutor-0-gyao-desktop.log,
> flink-gary-taskexecutor-1-gyao-desktop.log,
> flink-gary-taskexecutor-2-gyao-desktop.log,
> flink-gary-taskexecutor-3-gyao-desktop.log,
> flink-gary-taskexecutor-4-gyao-desktop.log,
> flink-gary-taskexecutor-5-gyao-desktop.log,
> flink-gary-taskexecutor-6-gyao-desktop.log
>
>
> *Description*
> Streaming bucketing end-to-end test ({{test_streaming_bucketing.sh}}) can
> fail with Output hash mismatch.
> {noformat}
> Number of running task managers has reached 4.
> Job (e0b7a86e4d4111f3947baa3d004e083a) is running.
> Waiting until all values have been produced
> Truncating buckets
> Number of produced values 26930/60000
> Truncating buckets
> Number of produced values 30890/60000
> Truncating buckets
> Number of produced values 37340/60000
> Truncating buckets
> Number of produced values 41290/60000
> Truncating buckets
> Number of produced values 46710/60000
> Truncating buckets
> Number of produced values 52120/60000
> Truncating buckets
> Number of produced values 57110/60000
> Truncating buckets
> Number of produced values 62530/60000
> Cancelling job e0b7a86e4d4111f3947baa3d004e083a.
> Cancelled job e0b7a86e4d4111f3947baa3d004e083a.
> Waiting for job (e0b7a86e4d4111f3947baa3d004e083a) to reach terminal state
> CANCELED ...
> Job (e0b7a86e4d4111f3947baa3d004e083a) reached terminal state CANCELED
> Job e0b7a86e4d4111f3947baa3d004e083a was cancelled, time to verify
> FAIL Bucketing Sink: Output hash mismatch. Got
> 9e00429abfb30eea4f459eb812b470ad, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
> head hexdump of actual:
> 0000000 ( 2 , 1 0 , 0 , S o m e p a y
> 0000010 l o a d . . . ) \n ( 2 , 1 0 , 1
> 0000020 , S o m e p a y l o a d . . .
> 0000030 ) \n ( 2 , 1 0 , 2 , S o m e p
> 0000040 a y l o a d . . . ) \n ( 2 , 1 0
> 0000050 , 3 , S o m e p a y l o a d .
> 0000060 . . ) \n ( 2 , 1 0 , 4 , S o m e
> 0000070 p a y l o a d . . . ) \n ( 2 ,
> 0000080 1 0 , 5 , S o m e p a y l o a
> 0000090 d . . . ) \n ( 2 , 1 0 , 6 , S o
> 00000a0 m e p a y l o a d . . . ) \n (
> 00000b0 2 , 1 0 , 7 , S o m e p a y l
> 00000c0 o a d . . . ) \n ( 2 , 1 0 , 8 ,
> 00000d0 S o m e p a y l o a d . . . )
> 00000e0 \n ( 2 , 1 0 , 9 , S o m e p a
> 00000f0 y l o a d . . . ) \n
> 00000fa
> Stopping taskexecutor daemon (pid: 55164) on host gyao-desktop.
> Stopping standalonesession daemon (pid: 51073) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 51504) on host gyao-desktop.
> Skipping taskexecutor daemon (pid: 52034), because it is not running anymore
> on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52472), because it is not running anymore
> on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52916), because it is not running anymore
> on gyao-desktop.
> Stopping taskexecutor daemon (pid: 54121) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 54726) on host gyao-desktop.
> [FAIL] Test script contains errors.
> Checking of logs skipped.
> [FAIL] 'flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh'
> failed after 2 minutes and 3 seconds! Test exited with exit code 1
> {noformat}
> *How to reproduce*
> Comment out the delay of 10s after the 1st TM is restarted to provoke the
> issue:
> {code:bash}
> echo "Restarting 1 TM"
> $FLINK_DIR/bin/taskmanager.sh start
> wait_for_number_of_running_tms 4
> #sleep 10
> echo "Killing 2 TMs"
> kill_random_taskmanager
> kill_random_taskmanager
> wait_for_number_of_running_tms 2
> {code}
> Command to run the test:
> {noformat}
> FLINK_DIR=build-target/ flink-end-to-end-tests/run-single-test.sh skip
> flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)