[
https://issues.apache.org/jira/browse/FLINK-26388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498761#comment-17498761
]
Matthias Pohl edited comment on FLINK-26388 at 2/28/22, 8:25 AM:
-----------------------------------------------------------------
{{docker-compose.yml}} for setting up ZK and Minio for HA setup:
{code}
version: "3.9"
services:
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
ports:
- 9000:9000
- 9001:9001
volumes:
- ./minio/data:/data
env_file: minio.env
zookeeper:
image: docker.io/bitnami/zookeeper:3.5
ports:
- 2181:2181
volumes:
- ./zookeeper/data2:/bitnami
environment:
- ZOO_SERVER_ID=1
- ALLOW_ANONYMOUS_LOGIN=yes
- ZOO_SERVERS=zookeeper:2888:3888
{code}
Please find a Flink Configuration adaptations for HA setup below. The
configuration assumes the following things:
* {{cleanup-test}} bucket created for general Flink HA-related artifacts
* {{jrs-store}} bucket created for {{JRS}}-related artifacts
* User {{cleanup-test}} with secret {{cleanup-test-123}} created which has
access to both buckets (this user should have read/write permission to the
newly created buckets)
{code}
high-availability: zookeeper
high-availability.zookeeper.quorum: localhost:2181
s3.endpoint: http://localhost:9000
s3.path.style.access: true
s3.access.key: cleanup-test
s3.secret.key: cleanup-test-123
high-availability.storageDir: s3://cleanup-test/
job-result-store.storage-path: s3://jrs-store/job-result-store
{code}
You can simulate a failure in S3 now by changing the access permissions for the
{{cleanup-test}} user in the {{cleanup-test}} bucket (while keeping the
permissions for accessing the {{jrs-store}} bucket as is)
was (Author: mapohl):
{{docker-compose.yml}} for setting up ZK and Minio for HA setup:
{code}
version: "3.9"
services:
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
ports:
- 9000:9000
- 9001:9001
volumes:
- ./minio/data:/data
env_file: minio.env
zookeeper:
image: docker.io/bitnami/zookeeper:3.5
ports:
- 2181:2181
volumes:
- ./zookeeper/data2:/bitnami
environment:
- ZOO_SERVER_ID=1
- ALLOW_ANONYMOUS_LOGIN=yes
- ZOO_SERVERS=zookeeper:2888:3888
{code}
Please find a Flink Configuration adaptations for HA setup below. The
configuration assumes the following things:
* {{cleanup-test}} bucket created for general Flink HA-related artifacts
* {{jrs-store}} bucket created for {{JRS}}-related artifacts
* User {{cleanup-test}} with secret {{cleanup-test-123}} created which has
access to both buckets (this user should have read/write permission to the
newly created buckets)
{code}
high-availability: zookeeper
high-availability.zookeeper.quorum: localhost:2181
s3.endpoint: http://localhost:9000
s3.path.style.access: true
s3.access.key: cleanup-test
s3.secret.key: cleanup-test-123
high-availability.storageDir: s3://cleanup-test/
job-result-store.storage-path: s3://jrs-store/job-result-store
{code}
You can simulate a failure in S3 now by changing the access permissions for the
{{cleanup-test}} user in the {{cleanup-test}} bucket (while keeping the
permissions for accessing the {{jrs-store}} bucket as is)
> Release Testing: Repeatable Cleanup
> -----------------------------------
>
> Key: FLINK-26388
> URL: https://issues.apache.org/jira/browse/FLINK-26388
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Major
>
> Repeatable cleanup got introduced with
> [FLIP-194|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-26284?filter=allopenissues]
> but should be considered as an independent feature of the {{JobResultStore}}
> (JRS) from a user's point of view. The documentation efforts are finalized
> with FLINK-26296.
> Repeatable cleanup can be triggered by running into an error while cleaning
> up. This can be achieved by disabling access to S3 after the job finished,
> e.g.:
> * Setting a reasonable enough checkpointing time (checkpointing should be
> enabled to allow cleanup of s3)
> * Disable s3 (removing permissions or shutting down the s3 server)
> * Stop job with savepoint
> Stopping the job should work but the logs should show failure with repeating
> retries. Enabling S3 again should fix the issue.
> Keep in mind that if testing this in with HA, you should use a different
> bucket for the file-based JRS artifacts only change permissions for the
> bucket that holds JRS-unrelated artifacts. Flink would fail fatally if the
> JRS is not able to access it's backend storage.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)