moiseenkov commented on code in PR #31640:
URL: https://github.com/apache/airflow/pull/31640#discussion_r1214111474


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -425,7 +427,10 @@ def _is_in_period(input_date: datetime) -> bool:
 
         :return: a list of matched keys
         """
-        prefix = prefix or ""
+        _prefix = prefix or ""
+        wildcard_prefix = _prefix
+        if apply_wildcard and "*" in _prefix:
+            wildcard_prefix = _prefix.split("*", 1)[0]

Review Comment:
   I refactored the code a bit to make it more overt. Coming back to your 
question, let's consider an example. Let's say we have a list of objects in our 
source bucket:
   ```
   my_file_a
   my_another_file_a
   my_file_b
   folder/
   folder/my_another_file_a
   folder/my_another_file_b
   ```
   If we want to get objects with the prefix `*a` and `apply_wildcard=True`, 
then only the following objects will be retrieved:
   ```
   my_file_a
   my_another_file_a
   folder/my_another_file_a
   ```
   Because `list_objects_v2` doesn't support wildcards, we have to handle them 
manually. For instance, we have another mask `my*a`. There are only two objects 
that correspond to it:
   ```
   my_file_a
   my_another_file_a
   ```
   To achieve this result we split our prefix by `*` and get the left part 
because all expected objects share this "fixed" part of the prefix. In our 
example it is a "subprefix" `my`. We pass it to the `list_objects_v2` and get 
the list of candidates:
   ```
   my_file_a
   my_another_file_a
   my_file_b
   ```
   At the next step we just iterate over this list and check if the objects 
paths fit to the original mask `my*a`:
   ```python
   fnmatch.fnmatch(k["Key"], _original_prefix)
   
   # fnmatch.fnmatch("my_file_a", "my*a") == True
   # fnmatch.fnmatch("my_another_file_a", "my*a") == True
   # fnmatch.fnmatch("my_file_b", "my*a") == False
   ```
   The last candidate is filtered out and we get only needed objects here.
   In your example prefix is `*a` and thus "fixed" part of the prefix is empty 
string `""` and thus at the first step of the algorithm `list_objects_v2` will 
retrieve all objects. At the second step of the algorithm we apply the original 
mask and filter out all "wrong" objects. This is a worst case but  
unfortunately, there is no way to handle it more efficiently than implementing 
a wildcard support inside the AWS SDK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to