[GitHub] [airflow] alexkruc commented on a diff in pull request #26886: Adding `preserve_file_name` param to `S3Hook.download_file` method

GitBox Fri, 07 Oct 2022 02:09:25 -0700


alexkruc commented on code in PR #26886:
URL: https://github.com/apache/airflow/pull/26886#discussion_r989868318



##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -902,14 +911,21 @@ def download_file(self, key: str, bucket_name: str | None 
= None, local_path: st
             else:
                 raise e
 
-        with NamedTemporaryFile(dir=local_path, prefix='airflow_tmp_', 
delete=False) as local_tmp_file:
+        if preserve_file_name:
+            local_dir = local_path if local_path else gettempdir()

Review Comment:
   @XD-DENG has described a real issue with my implementation. If 2 different 
tasks try to download a file with the same name (let's say 
`s3://....../data.json`), the end file will be rewritten because of the way 
that file mode `w` works..
   We have a couple of ways to guard us against this that I can think of, and 
before I'll implement any of them, I would like to have the people in this 
thread for their opinion:
   
   1. Add a check that verifies if the end file already exists and, if it does, 
raise an exception and fail the task. This will prevent accidental overwrites 
and will delegate the problem to the user.
   2. As @uranusjr suggested, adding a new "layer" - means specifying a 
subdirectory that will sit under the tmpdir (I guess we need the sub dir only 
when we are not providing an explicit `local_path`, because only then can 
un-expected clashes happen). But this approach raises a bit more questions: <br>
   2.1. Should the sub-dir be random generated (as @uranusjr mentioned- for the 
path to be unpredictable) - something like 
`/tmp/airflow_file_dir_ha235cd1/data.json`, or should we let the user control 
the path, and let them to have a predicted end path still, but avoid the 
clashes - something like `/tmp/my_json_data/data.json`
   2.2. Should the sub-dir be only created when the user wants to keep the 
original file name? The previous flow (the one that generates a filename) 
should not have one because it's a breaking change to the flow, right?
   This can cause a bit of strange behavior. When using the original file name, 
we create a temp dir, but on the other flow, we just put the file.. It feels a 
bit confusing to me. WDYT? 
   
   I took the liberty of checking what the current hooks of GCS and in Azure 
are providing regarding saving a file. I see that the end file name is often an 
open parameter, so users can provide whatever filename they want.. IMO it's a 
more significant vulnerability risk, but this is the way they currently work:
   
https://github.com/apache/airflow/blob/2f326a6c03efed8788fe0263df96b68abb801088/airflow/providers/google/cloud/hooks/gcs.py#L320
   
https://github.com/apache/airflow/blob/2f326a6c03efed8788fe0263df96b68abb801088/airflow/providers/microsoft/azure/hooks/fileshare.py#L226
   
https://github.com/apache/airflow/blob/2f326a6c03efed8788fe0263df96b68abb801088/airflow/providers/microsoft/azure/hooks/data_lake.py#L194
   
   I guess it's also an option for us, let's call it option 3 - to have the 
user control the file name & path, with the default of the actual file name in 
S3, but this feels "less" secure than defaulting to S3 name, because in S3 you 
can't have filenames like `../../../my_expliot.sh` that can behave strange on 
runtime 😅
   
   IMO - I'm in favour of option 1, and if several try to download the same 
filename, to block the overwrite by failing the job. It can point to some "data 
design" problems or something, but users can solve it by specifying their own 
`local_path` param to each job that uses this function, this is also the better 
modeling when using this hook method IMO.
   
   @uranusjr @XD-DENG @o-nikolas - I will truly appreciate your input on this 
:) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] alexkruc commented on a diff in pull request #26886: Adding `preserve_file_name` param to `S3Hook.download_file` method

Reply via email to