[GitHub] [airflow] Taragolis commented on pull request #26886: Adding `preserve_file_name` param to `S3Hook.download_file` method

GitBox Thu, 06 Oct 2022 09:15:00 -0700


Taragolis commented on PR #26886:
URL: https://github.com/apache/airflow/pull/26886#issuecomment-1270348205


   I do not have any objections about this PR. My messages more like discussion 
and ideas not related to current PR.
   
   My main objection about method `S3Hook.download_file` if it would be 
initially something like
   
   ```python
       @provide_bucket_name
       @unify_bucket_name_and_key
       def download_file(
           self,
           key: str,
           bucket_name: str | None = None, 
           filename: str | None = None, 
           extra_args=None, 
           config=None
       ) -> None:
           self.conn.download_file(
               bucket_name, key, filename, ExtraArgs=extra_args, Callback=None, 
Config=config
           )
   ```
   
   Then we do not even have this discussion and modification not required 
because it just would be simple thin wrapper around one boto3 s3 client method 
which resolved different combination of key and bucket name.
   
   But we have what we have and we cannot revert time back. So it might a lot 
of user code exists which use this method, how users use current method and 
what they additionally wanted. I could just predict:
   - User want to save to location which mount to persistent storage (NFS 
Share) and also want to keep `prefix` not only `filename`
   - User want to check is this file already exists in File System and raise an 
error
   - User want to add Object Version to the filename
   - User want to delete file in the end of the `execute()` method of his 
operator / callable function
   
   My current idea - just inform (at least in docstring) that might exists 
better way which could help to solve their cases rather than try to use this 
method.
   
   Also for better user experience we could create some additional static 
method in hook which could be use for split s3 URL to the parts, it might help 
them to do some specific stuff without write some additional code.
   
   **Sample of helper**
   ```python
   from typing import NamedTuple
   from urllib.parse import urlparse
   
   
   class S3UrlParts(NamedTuple):
       bucket: str
       key: str | None
       prefix: str | None
       parts: list[str]
       filename: str | None
       basename: str | None
       suffix: str | None
       suffixes: list[str]
   
   
   def parse_url(url: str):
       if "://" not in url:
           raise ValueError(f"Invalid S3 URL: {url!r}")
       url_parts = urlparse(url)
       if url_parts.scheme != "s3":
           # Yeah, I know that about 4 ways to define S3 URL and not limited by 
s3://bucket/prefix/key.json
           raise ValueError(f"Invalid S3 URL (incorrect schema): {url!r}")
       elif not url_parts.netloc:
           raise ValueError(f"Invalid S3 URL (no bucket name): {url!r}")
   
       bucket = url_parts.netloc
       key = url_parts.path.lstrip("/")
       key_parts = key.rsplit("/", 1)
       if len(key_parts) == 2:
           prefix, filename = key_parts
       else:
           prefix, filename = None, key_parts[0]
   
       parts = []
       basename = None
       suffix = None
       suffixes = []
       if prefix:
           parts.extend(prefix.split("/"))
       if filename:
           parts.append(filename)
   
           file_parts = filename.split(".", 1)
           if len(file_parts) == 2:
               basename, suffix = file_parts
               suffixes = suffix.split(".")
           else:
               basename, suffix = file_parts[0], None
   
       return S3UrlParts(
           bucket,
           key=key,
           prefix=prefix or None,
           parts=parts,
           filename=filename or None,
           basename=basename,
           suffix=suffix,
           suffixes=suffixes
       )
   ```
   
   **Sample of "user" code**
   ```python
   s3_urls = [
       "s3://awesome-bucket/this/is/path/to/awesome-data.parquet.snappy",
       "s3://awesome-bucket",
       "s3://awesome-bucket/",
       "s3://awesome-bucket/prefix/",
       "s3://awesome-bucket/key.json",
       "s3://awesome-bucket/key-with-no-extension",
       "s3://awesome-bucket/key-with-dot-in-the-end.",
   ]
   
   for s3_url in s3_urls:
       print(f"S3 URL: {s3_url}")
       print(f"Parsed URL: {parse_url(s3_url)}")
       print()
   ```
   
   **Sample Output**
   ```
   S3 URL: s3://awesome-bucket/this/is/path/to/awesome-data.parquet.snappy
   S3UrlParts(bucket='awesome-bucket', 
key='this/is/path/to/awesome-data.parquet.snappy', prefix='this/is/path/to', 
parts=['this', 'is', 'path', 'to', 'awesome-data.parquet.snappy'], 
filename='awesome-data.parquet.snappy', basename='awesome-data', 
suffix='parquet.snappy', suffixes=['parquet', 'snappy'])
   
   S3 URL: s3://awesome-bucket
   S3UrlParts(bucket='awesome-bucket', key='', prefix=None, parts=[], 
filename=None, basename=None, suffix=None, suffixes=[])
   
   S3 URL: s3://awesome-bucket/
   S3UrlParts(bucket='awesome-bucket', key='', prefix=None, parts=[], 
filename=None, basename=None, suffix=None, suffixes=[])
   
   S3 URL: s3://awesome-bucket/prefix/
   S3UrlParts(bucket='awesome-bucket', key='prefix/', prefix='prefix', 
parts=['prefix'], filename=None, basename=None, suffix=None, suffixes=[])
   
   S3 URL: s3://awesome-bucket/key.json
   S3UrlParts(bucket='awesome-bucket', key='key.json', prefix=None, 
parts=['key.json'], filename='key.json', basename='key', suffix='json', 
suffixes=['json'])
   
   S3 URL: s3://awesome-bucket/key-with-no-extension
   S3UrlParts(bucket='awesome-bucket', key='key-with-no-extension', 
prefix=None, parts=['key-with-no-extension'], filename='key-with-no-extension', 
basename='key-with-no-extension', suffix=None, suffixes=[])
   
   S3 URL: s3://awesome-bucket/key-with-dot-in-the-end.
   S3UrlParts(bucket='awesome-bucket', key='key-with-dot-in-the-end.', 
prefix=None, parts=['key-with-dot-in-the-end.'], 
filename='key-with-dot-in-the-end.', basename='key-with-dot-in-the-end', 
suffix='', suffixes=[''])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] Taragolis commented on pull request #26886: Adding `preserve_file_name` param to `S3Hook.download_file` method

Reply via email to