bentorb commented on code in PR #59042:
URL: https://github.com/apache/airflow/pull/59042#discussion_r2630766376
##########
providers/amazon/src/airflow/providers/amazon/aws/operators/s3.py:
##########
@@ -366,6 +366,159 @@ def get_openlineage_facets_on_start(self):
)
+class S3CopyPrefixOperator(AwsBaseOperator[S3Hook]):
+ """
+ Creates a copy of all objects under a prefix already stored in S3.
+
+ Note: the S3 connection used here needs to have access to both
+ source and destination bucket/prefix.
+
+ .. seealso::
+ For more information on how to use this operator, take a look at the
guide:
+ :ref:`howto/operator:S3CopyPrefixOperator`
+
+ :param source_bucket_prefix: The prefix in the source bucket. (templated)
+ It can be either full s3:// style url or relative path from root level.
+ When it's specified as a full s3:// url, please omit
source_bucket_name.
+ :param dest_bucket_prefix: The prefix in the destination to copy to.
(templated)
+ The convention to specify `dest_bucket_prefix` is the same as
`source_bucket_prefix`.
+ :param source_bucket_name: Name of the S3 bucket where the source objects
are in. (templated)
+ It should be omitted when `source_bucket_prefix` is provided as a full
s3:// url.
+ :param dest_bucket_name: Name of the S3 bucket to where the objects are
copied. (templated)
+ It should be omitted when `dest_bucket_prefix` is provided as a full
s3:// url.
+ :param page_size: Number of objects to list per page when paginating
through S3 objects.
Review Comment:
This is a very good question/point. The maximum length of an S3 object key
is 1,024 bytes (UTF-8). At the same time, both the default and maximum value
for `page_size` is 1000. This means that in theory, the raw data retrieved by a
single `list_objects_v2`API call should fit in at most 1MB, which nowadays is a
very small memory footprint.
The main reason I decided to include this parameter is for the special case
when it is equal to 0, which results in no data being copied. However, as we
are discussing in the next comment, we might not want to support that. In that
case, we could just remove the parameter altogether.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]