[jira] [Commented] (FLINK-26388) Release Testing: Repeatable Cleanup (FLINK-25433)

Matthias Pohl (Jira) Thu, 10 Mar 2022 04:11:21 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504217#comment-17504217
 ]


Matthias Pohl commented on FLINK-26388:
---------------------------------------

[~dwysakowicz] FLINK-26450 and FLINK-26484 should have fixed the issues around 
the S3 FileSystems (hadoop and presto). We could go ahead and do a final round 
of testing.

FLINK-26494 is currently blocked by another test failure where we investigate 
whether it's related to the change (which it most likely is not). But I didn't 
go ahead with it since it's a usability feature (adding additional logs during 
a retry). So, I would say it's necessary for the release testing to be completed

> Release Testing: Repeatable Cleanup (FLINK-25433)
> -------------------------------------------------
>
>                 Key: FLINK-26388
>                 URL: https://issues.apache.org/jira/browse/FLINK-26388
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> Repeatable cleanup got introduced with 
> [FLIP-194|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-26284?filter=allopenissues]
>  but should be considered as an independent feature of the {{JobResultStore}} 
> (JRS) from a user's point of view.
> Repeatable cleanup can be triggered by running into an error while cleaning 
> up. This can be achieved by disabling access to S3 after the job finished, 
> e.g.:
> * Setting a reasonable enough checkpointing time (checkpointing should be 
> enabled to allow cleanup of s3)
> * Disable s3 (removing permissions or shutting down the s3 server)
> * Stop job with savepoint
> Stopping the job should work but the logs should show failure with repeating 
> retries. Enabling S3 again should fix the issue.
> Keep in mind that if testing this in with HA, you should use a different 
> bucket for the file-based JRS artifacts only change permissions for the 
> bucket that holds JRS-unrelated artifacts. Flink would fail fatally if the 
> JRS is not able to access it's backend storage.
> Documentation and configuration is still in the process of being updated in 
> FLINK-26296 and FLINK-26331



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26388) Release Testing: Repeatable Cleanup (FLINK-25433)

Reply via email to