moiseenkov commented on code in PR #31640:
URL: https://github.com/apache/airflow/pull/31640#discussion_r1214111474
##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -425,7 +427,10 @@ def _is_in_period(input_date: datetime) -> bool:
:return: a list of matched keys
"""
- prefix = prefix or ""
+ _prefix = prefix or ""
+ wildcard_prefix = _prefix
+ if apply_wildcard and "*" in _prefix:
+ wildcard_prefix = _prefix.split("*", 1)[0]
Review Comment:
I refactored the code a bit to make it more overt. Coming back to your
question, let's consider an example. Let's say we have a list of objects in our
source bucket:
```
my_file_a
my_another_file_a
my_file_b
folder/
folder/my_another_file_a
folder/my_another_file_b
```
If we want to get objects with the prefix `*a` and `apply_wildcard=True`,
then only the following objects will be retrieved:
```
my_file_a
my_another_file_a
folder/my_another_file_a
```
Because `list_objects_v2` doesn't support wildcards, we have to handle them
manually. For instance, we have another mask `my*a`. There are only two objects
that correspond to it:
```
my_file_a
my_another_file_a
```
To achieve this result we split our prefix by `*` and get the left part
because all expected objects share this "fixed" part of the prefix. In our
example it is a "subprefix" `my`. We pass it to the `list_objects_v2` and get
the list of candidates:
```
my_file_a
my_another_file_a
my_file_b
```
At the next step we just iterate over this list and check if the objects
paths fit to the original mask `my*a`:
```python
fnmatch.fnmatch(k["Key"], _original_prefix)
# fnmatch.fnmatch("my_file_a", "my*a") == True
# fnmatch.fnmatch("my_another_file_a", "my*a") == True
# fnmatch.fnmatch("my_file_b", "my*a") == False
```
The last candidate is filtered out and we get only needed objects here.
In your example prefix is `*a` and thus "fixed" part of the prefix is empty
string `""` and thus at the first step of the algorithm `list_objects_v2` will
retrieve all objects. At the second step of the algorithm we apply the original
mask and filter out all "wrong" objects. This is a worst case but
unfortunately, there is no way to handle it more efficiently than implementing
a wildcard support inside the AWS SDK.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]