[jira] [Commented] (FLINK-26388) Release Testing: Repeatable Cleanup

Matthias Pohl (Jira) Mon, 28 Feb 2022 00:25:08 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498761#comment-17498761
 ]


Matthias Pohl commented on FLINK-26388:
---------------------------------------

{{docker-compose.yml}} for setting up ZK and Minio for HA setup:
{code}
version: "3.9"
services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    ports:
      - 9000:9000
      - 9001:9001
    volumes:
      - ./minio/data:/data
    env_file: minio.env
  zookeeper:
    image: docker.io/bitnami/zookeeper:3.5
    ports:
      - 2181:2181
    volumes:
      - ./zookeeper/data2:/bitnami
    environment:
      - ZOO_SERVER_ID=1
      - ALLOW_ANONYMOUS_LOGIN=yes
      - ZOO_SERVERS=zookeeper:2888:3888
{code}

Please find a Flink Configuration adaptations for HA setup below. The 
configuration assumes the following things:
* {{cleanup-test}} bucket created for general Flink HA-related artifacts
* {{jrs-store}} bucket created for {{JRS}}-related artifacts
* User {{cleanup-test}} with secret {{cleanup-test-123}} created which has 
access to both buckets (this user should have read/write permission to the 
newly created buckets)
{code}
high-availability: zookeeper
high-availability.zookeeper.quorum: localhost:2181

s3.endpoint: http://localhost:9000
s3.path.style.access: true
s3.access.key: cleanup-test
s3.secret.key: cleanup-test-123

high-availability.storageDir: s3://cleanup-test/

job-result-store.storage-path: s3://jrs-store/job-result-store
{code}
You can simulate a failure in S3 now by changing the access permissions for the 
{{cleanup-test}} user in the {{cleanup-test}} bucket (while keeping the 
permissions for accessing the {{jrs-store}} bucket as is)

> Release Testing: Repeatable Cleanup
> -----------------------------------
>
>                 Key: FLINK-26388
>                 URL: https://issues.apache.org/jira/browse/FLINK-26388
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Priority: Major
>
> Repeatable cleanup got introduced with 
> [FLIP-194|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-26284?filter=allopenissues]
>  but should be considered as an independent feature of the {{JobResultStore}} 
> (JRS) from a user's point of view. The documentation efforts are finalized 
> with FLINK-26296.
> Repeatable cleanup can be triggered by running into an error while cleaning 
> up. This can be achieved by disabling access to S3 after the job finished, 
> e.g.:
> * Setting a reasonable enough checkpointing time (checkpointing should be 
> enabled to allow cleanup of s3)
> * Disable s3 (removing permissions or shutting down the s3 server)
> * Stop job with savepoint
> Stopping the job should work but the logs should show failure with repeating 
> retries. Enabling S3 again should fix the issue.
> Keep in mind that if testing this in with HA, you should use a different 
> bucket for the file-based JRS artifacts only change permissions for the 
> bucket that holds JRS-unrelated artifacts. Flink would fail fatally if the 
> JRS is not able to access it's backend storage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26388) Release Testing: Repeatable Cleanup

Reply via email to