bentorb commented on code in PR #59042:
URL: https://github.com/apache/airflow/pull/59042#discussion_r2630619301


##########
providers/amazon/src/airflow/providers/amazon/aws/operators/s3.py:
##########
@@ -366,6 +366,159 @@ def get_openlineage_facets_on_start(self):
         )
 
 
+class S3CopyPrefixOperator(AwsBaseOperator[S3Hook]):
+    """
+    Creates a copy of all objects under a prefix already stored in S3.
+
+    Note: the S3 connection used here needs to have access to both
+    source and destination bucket/prefix.
+
+    .. seealso::
+        For more information on how to use this operator, take a look at the 
guide:
+        :ref:`howto/operator:S3CopyPrefixOperator`
+
+    :param source_bucket_prefix: The prefix in the source bucket. (templated)
+        It can be either full s3:// style url or relative path from root level.
+        When it's specified as a full s3:// url, please omit 
source_bucket_name.
+    :param dest_bucket_prefix: The prefix in the destination to copy to. 
(templated)
+        The convention to specify `dest_bucket_prefix` is the same as 
`source_bucket_prefix`.
+    :param source_bucket_name: Name of the S3 bucket where the source objects 
are in. (templated)
+        It should be omitted when `source_bucket_prefix` is provided as a full 
s3:// url.
+    :param dest_bucket_name: Name of the S3 bucket to where the objects are 
copied. (templated)
+        It should be omitted when `dest_bucket_prefix` is provided as a full 
s3:// url.
+    :param page_size: Number of objects to list per page when paginating 
through S3 objects.
+        Low values result in more API calls, high values increase memory usage.
+        Between 1 and 1000, setting it to 0 results in no objects copied. 
Default is 1000.

Review Comment:
   My initial thought for allowing a value of 0 is that it can potentially be 
used as a dynamic mechanism to 'disable' the task without having to modify the 
DAG. For instance, in the event of facing an incident with corrupted data, one 
can quickly stop copying files by setting this parameter to 0, without having 
to deal with also changing unit-tests, etc. This is just an example, I can see 
other scenarios in which it might be useful.
   
   Furthermore, I feel like in this context, a value of 0 is relatively 
self-explanatory. I also think `page_size` is not a 'trivial' parameter, and 
anyone providing a custom value (instead of using the default) would have a 
minimum understanding on how to use it. That being said, I didn't know that 0 
often represents unbounded in Airflow, so I can totally understand that 
confusions might happen.
   
   Personally, I don't have a strong opinion about this, so I'm happy to go 
with your approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to