moiseenkov opened a new pull request, #32376:
URL: https://github.com/apache/airflow/pull/32376
Bugfix for the following case (b/289486604).
The goal is to copy `source/foo.txt` to `dest/foo.txt` within a single GCS
bucket.
1. Create a GCS bucket and upload two files to source directory like this:
```
gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
```
2. Upload the following DAG to a Cloud Composer environment:
```python
from airflow import DAG
from airflow.providers.google.cloud.transfers.gcs_to_gcs import
GCSToGCSOperator
from datetime import datetime
with DAG(
dag_id="gcs_to_gcs_fail_example",
schedule_interval=None,
catchup=False,
start_date=datetime(2021,1,1)
) as dag:
copy_file = GCSToGCSOperator(
task_id="copy_file",
source_bucket="my-bucket",
source_object="source/foo.txt",
destination_object="dest/foo.txt",
exact_match=True,
)
copy_file
```
3. Run the DAG
**Expected bucket state:**
```
gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
gs://my-bucket/dest/foo.txt
```
**Actual (incorrect) bucket state:**
```
gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
gs://my-bucket/dest/foo.txt/source/foo.txt
```
======================================================
The reason for this bug was the lack of handling `exact_match=True` when
objects are being copied without a wildcard. This problem is fixed in the
current PR.
======================================================
However, if the flag is set to its default value `exact_match=False`, then
the operator's result is different:
```
gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
gs://my-bucket/dest/foo.txt/source/foo.txt
gs://my-bucket/dest/foo.txt/source/foo.txt.abc
```
It's actually correct, because in general
`source_object="path/to/the/file.txt"` is not treated as a file path, but as an
object name **prefix**
([doc](https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix)).
That's why the **prefix** `source_object="path/to/the/file.txt"` corresponds
to both objects:
```
gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
```
And if the destination_object is set, then the destination object prefix is
just built as a concatenation of the source prefix and the destination prefix.
There is no difference for GCS what is being copied: a file or a folder - both
of these entities are the same things - objects.
Perhaps, it makes sense to implement more "human friendly" logic, so the
operator would act with inputs as with files and folders, but I think it should
be another operator, because `GCSToGCSOperator`'s current implementation became
too complicated for major changes. This is just my thoughts, I'm not insisting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]