moiseenkov opened a new pull request, #32376:
URL: https://github.com/apache/airflow/pull/32376

   Bugfix for the following case (b/289486604).
   
   The goal is to copy `source/foo.txt` to `dest/foo.txt` within a single GCS 
bucket.
   1. Create a GCS bucket and upload two files to source directory like this:
   ```
   gs://my-bucket/source/foo.txt
   gs://my-bucket/source/foo.txt.abc
   ```
   2. Upload the following DAG to a Cloud Composer environment:
   ```python
   from airflow import DAG
   from airflow.providers.google.cloud.transfers.gcs_to_gcs import 
GCSToGCSOperator
   from datetime import datetime
   
   with DAG(
       dag_id="gcs_to_gcs_fail_example",
       schedule_interval=None,
       catchup=False,
       start_date=datetime(2021,1,1)
   ) as dag:
       copy_file = GCSToGCSOperator(
           task_id="copy_file",
           source_bucket="my-bucket",
           source_object="source/foo.txt",
           destination_object="dest/foo.txt",
           exact_match=True,
       )
       copy_file
   ```
   3. Run the DAG
   
   **Expected bucket state:**
   ```
   gs://my-bucket/source/foo.txt
   gs://my-bucket/source/foo.txt.abc
   gs://my-bucket/dest/foo.txt
   ```
   **Actual (incorrect) bucket state:**
   ```
   gs://my-bucket/source/foo.txt
   gs://my-bucket/source/foo.txt.abc
   gs://my-bucket/dest/foo.txt/source/foo.txt
   ```
   
   ======================================================
   
   The reason for this bug was the lack of handling `exact_match=True` when 
objects are being copied without a wildcard. This problem is fixed in the 
current PR.
   
   ======================================================
   
   However, if the flag is set to its default value `exact_match=False`, then 
the operator's result is different:
   ```
   gs://my-bucket/source/foo.txt
   gs://my-bucket/source/foo.txt.abc
   gs://my-bucket/dest/foo.txt/source/foo.txt
   gs://my-bucket/dest/foo.txt/source/foo.txt.abc
   ```
   It's actually correct, because in general 
`source_object="path/to/the/file.txt"` is not treated as a file path, but as an 
object name **prefix** 
([doc](https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix)).
 That's why the **prefix** `source_object="path/to/the/file.txt"` corresponds 
to both objects:
   ```
   gs://my-bucket/source/foo.txt
   gs://my-bucket/source/foo.txt.abc
   ```
   And if the destination_object is set, then the destination object prefix is 
just built as a concatenation of the source prefix and the destination prefix. 
There is no difference for GCS what is being copied: a file or a folder - both 
of these entities are the same things - objects.
   
   Perhaps, it makes sense to implement more "human friendly" logic, so the 
operator would act with inputs as with files and folders, but I think it should 
be another operator, because `GCSToGCSOperator`'s current implementation became 
too complicated for major changes. This is just my thoughts, I'm not insisting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to