[
https://issues.apache.org/jira/browse/FLINK-19589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Padarn Wilson updated FLINK-19589:
----------------------------------
Description:
This ticket proposes to expose the management of two properties related S3
Object management:
- [Lifecycle configuration
|https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html]
- [Object
tagging|https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.htm]
Being able to control these is useful for people who want to manage jobs using
S3 for checkpointing or job output, but need to control per job level
configuration of the tagging/lifecycle for the purposes of auditing or cost
control (for example deleting old state from S3)
Ideally, it would be possible to control this on each object being written by
Flink, or at least at a job level.
_Note_*:* Some related existing properties can be set using the hadoop module
using system properties: see for example
fs.s3a.acl.default
which sets the default ACL on written objects.
*Solutions*:
1) Modify hadoop module:
The above-linked module could be updated in order to have a new property (and
similar for lifecycle)
fs.s3a.tags.default
which could be a comma separated list of tags to set. For example
fs.s3a.acl.default = "jobname:JOBNAME,owner:OWNER"
This seems like a natural place to put this logic (and is outside of Flink if
we decide to go this way. However it does not allow for a sink and checkpoint
to have different values for these.
2) Expose withTagging from module
The hadoop module used by Flink's existing filesystem has already exposed put
request level tagging (see
[this|https://github.com/aws/aws-sdk-java/blob/c06822732612d7208927d2a678073098522085c3/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/PutObjectRequest.java#L292]).
This could be used in the Flink filesystem plugin to expose these options. A
possible approach could be to somehow incorporate it into the file path, e.g.,
path = "TAGS:s3://bucket/path"
Or possible as an option that can be applied to the checkpoint and sink
configurations, e.g.,
env.getCheckpointingConfig().setS3Tags(TAGS)
and similar for a file sink.
_Note_: The lifecycle can also be managed using the module: see
[here|https://docs.aws.amazon.com/AmazonS3/latest/dev/manage-lifecycle-using-java.html].
was:
This ticket proposes to expose the management of two properties related S3
Object management:
- [Lifecycle configuration
|https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html]
- [Object tagging
|https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.html]
Being able to control these is useful for people who want to manage jobs using
S3 for checkpointing or job output, but need to control per job level
configuration of the tagging/lifecycle for the purposes of auditing or cost
control (for example deleting old state from S3)
Ideally, it would be possible to control this on each object being written by
Flink, or at least at a job level.
_Note_*:* Some related existing properties can be set using the hadoop module
using system properties: see for example
fs.s3a.acl.default
which sets the default ACL on written objects.
*Solutions*:
1) Modify hadoop module:
The above-linked module could be updated in order to have a new property (and
similar for lifecycle)
fs.s3a.tags.default
which could be a comma separated list of tags to set. For example
fs.s3a.acl.default = "jobname:JOBNAME,owner:OWNER"
This seems like a natural place to put this logic (and is outside of Flink if
we decide to go this way. However it does not allow for a sink and checkpoint
to have different values for these.
2) Expose withTagging from module
The hadoop module used by Flink's existing filesystem has already exposed put
request level tagging (see
[this|https://github.com/aws/aws-sdk-java/blob/c06822732612d7208927d2a678073098522085c3/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/PutObjectRequest.java#L292]).
This could be used in the Flink filesystem plugin to expose these options. A
possible approach could be to somehow incorporate it into the file path, e.g.,
path = "TAGS:s3://bucket/path"
Or possible as an option that can be applied to the checkpoint and sink
configurations, e.g.,
env.getCheckpointingConfig().setS3Tags(TAGS)
and similar for a file sink.
_Note_: The lifecycle can also be managed using the module: see
[here|https://docs.aws.amazon.com/AmazonS3/latest/dev/manage-lifecycle-using-java.html].
> Expose S3 options for tagging and object lifecycle policy for FileSystem
> ------------------------------------------------------------------------
>
> Key: FLINK-19589
> URL: https://issues.apache.org/jira/browse/FLINK-19589
> Project: Flink
> Issue Type: Improvement
> Components: FileSystems
> Affects Versions: 1.12.0
> Reporter: Padarn Wilson
> Assignee: Padarn Wilson
> Priority: Minor
>
> This ticket proposes to expose the management of two properties related S3
> Object management:
> - [Lifecycle configuration
> |https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html]
> - [Object
> tagging|https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.htm]
> Being able to control these is useful for people who want to manage jobs
> using S3 for checkpointing or job output, but need to control per job level
> configuration of the tagging/lifecycle for the purposes of auditing or cost
> control (for example deleting old state from S3)
> Ideally, it would be possible to control this on each object being written by
> Flink, or at least at a job level.
> _Note_*:* Some related existing properties can be set using the hadoop module
> using system properties: see for example
> fs.s3a.acl.default
> which sets the default ACL on written objects.
> *Solutions*:
> 1) Modify hadoop module:
> The above-linked module could be updated in order to have a new property (and
> similar for lifecycle)
> fs.s3a.tags.default
> which could be a comma separated list of tags to set. For example
> fs.s3a.acl.default = "jobname:JOBNAME,owner:OWNER"
> This seems like a natural place to put this logic (and is outside of Flink
> if we decide to go this way. However it does not allow for a sink and
> checkpoint to have different values for these.
> 2) Expose withTagging from module
> The hadoop module used by Flink's existing filesystem has already exposed put
> request level tagging (see
> [this|https://github.com/aws/aws-sdk-java/blob/c06822732612d7208927d2a678073098522085c3/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/PutObjectRequest.java#L292]).
> This could be used in the Flink filesystem plugin to expose these options. A
> possible approach could be to somehow incorporate it into the file path, e.g.,
> path = "TAGS:s3://bucket/path"
> Or possible as an option that can be applied to the checkpoint and sink
> configurations, e.g.,
> env.getCheckpointingConfig().setS3Tags(TAGS)
> and similar for a file sink.
> _Note_: The lifecycle can also be managed using the module: see
> [here|https://docs.aws.amazon.com/AmazonS3/latest/dev/manage-lifecycle-using-java.html].
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)