[jira] [Comment Edited] (FLINK-26391) Release Testing: Application Mode recovery does not re-trigger a job which failed during cleanup (FLINK-11813)

Yang Wang (Jira) Sun, 13 Mar 2022 19:12:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499874#comment-17499874
 ]


Yang Wang edited comment on FLINK-26391 at 3/14/22, 2:11 AM:
-------------------------------------------------------------

Test this ticket on Yarn with following steps.

1. [PASS] Submit a Flink application with specified high-availability cluster-id
{code:java}
./bin/flink run-application -t yarn-application -d 
-Dhigh-availability.cluster-id=test-job-result-store 
-Dhigh-availability=ZooKeeper 
-Dhigh-availability.zookeeper.quorum=i22xxxx:12181 
-Dhigh-availability.storageDir=hdfs://flinkdev/tmp/flink-ha-yiqi 
-Djob-result-store.delete-on-commit=true 
examples/streaming/StateMachineExample.jar
{code}
2. [PASS] Wait for the job running and cancel it via the webUI
3. [PASS] Verify the job result store at 
{{hdfs://flinkdev/tmp/flink-ha-yiqi/job-result-store/test-job-result-store/00000000000000000000000000000000.json}}
The content is as following.
{code:java}
{"result":{"id":"00000000000000000000000000000000","application-status":"CANCELED","accumulator-results":{},"net-runtime":74563},"version":1}
{code}
4. [PASS] Rename the generated job result store file to dirty
5. [PASS] Using the command in step1 to submit a Flink application again
6. [PASS] Verify the job does not run again and finish directly. Also the dirty 
job result store has been marked clean.
7. [PASS] Start the Flink application again using the command in step1 with 
{{job-result-store.delete-on-commit=true}}
8. [{color:#ff0000}NOT PASS{color}] Verify the clean job result store file is 
deleted

 

cc [~mapohl] I am not sure whether the step8 is the expected behavior. Because 
when I start a new Flink application with delete-on-commit=true at the very 
beginning, we will not have the retained job result store file.

 

 

Update:

The step8 in the above comment is not a bug. Because I set the 
{{job-result-store.delete-on-commit=true}} in the initial run, the users need 
to clean up the file job result store manually.


was (Author: fly_in_gis):
Test this ticket on Yarn with following steps.


1. [PASS] Submit a Flink application with specified high-availability cluster-id
{code:java}
./bin/flink run-application -t yarn-application -d 
-Dhigh-availability.cluster-id=test-job-result-store 
-Dhigh-availability=ZooKeeper 
-Dhigh-availability.zookeeper.quorum=i22xxxx:12181 
-Dhigh-availability.storageDir=hdfs://flinkdev/tmp/flink-ha-yiqi 
-Djob-result-store.delete-on-commit=true 
examples/streaming/StateMachineExample.jar
{code}
2. [PASS] Wait for the job running and cancel it via the webUI
3. [PASS] Verify the job result store at 
{{hdfs://flinkdev/tmp/flink-ha-yiqi/job-result-store/test-job-result-store/00000000000000000000000000000000.json}}
The content is as following.
{code:java}
{"result":{"id":"00000000000000000000000000000000","application-status":"CANCELED","accumulator-results":{},"net-runtime":74563},"version":1}
{code}
4. [PASS] Rename the generated job result store file to dirty
5. [PASS] Using the command in step1 to submit a Flink application again
6. [PASS] Verify the job does not run again and finish directly. Also the dirty 
job result store has been marked clean.
7. [PASS] Start the Flink application again using the command in step1 with 
{{job-result-store.delete-on-commit=true}}
8. [{color:#FF0000}NOT PASS{color}] Verify the clean job result store file is 
deleted

 

cc [~mapohl] I am not sure whether the step8 is the expected behavior. Because 
when I start a new Flink application with delete-on-commit=true at the very 
beginning, we will not have the retained job result store file.

> Release Testing: Application Mode recovery does not re-trigger a job which 
> failed during cleanup (FLINK-11813)
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26391
>                 URL: https://issues.apache.org/jira/browse/FLINK-26391
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Assignee: Yang Wang
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> FLINK-11813 is about not being able to determine whether a job has been 
> terminated globally before a failover happened. Testing this behavior can be 
> achieved by running a job in HA mode to enable the file-based 
> {{JobResultStore}} (JRS).
> You can specify 
> [job-result-store.storage-path|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-storage-path]
>  to point to a directory which you can access. 
> [job-result-store.delete-on-commit|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-delete-on-commit]
>  can be used to make the JRS artifacts not being deleted after a job finished.
> You can make a job finish to generate a the JRS artifact for this job in the 
> specified directory. Renaming the generated file from {{<job-id>.json}} to 
> {{<job-id>_DIRTY.json}} will simulate the job not being cleaned up properly. 
> Starting the job in application mode once more (through specifying the 
> corresponding Job ID) should lead to the job not being started again (you 
> might want to enable {{debug}} logging to verify the logs), i.e.:
> * Cleanup should be performed. 
> * No JobMaster-related logs should appear in the Flink logs.
> * cleanup-related logs should appear in the Flink logs.
> * At the end, the {{_DIRTY.json}} file extension should have been removed 
> from the JRS artifact again



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-26391) Release Testing: Application Mode recovery does not re-trigger a job which failed during cleanup (FLINK-11813)

Reply via email to