Taragolis commented on PR #26886:
URL: https://github.com/apache/airflow/pull/26886#issuecomment-1270348205
I do not have any objections about this PR. My messages more like discussion
and ideas not related to current PR.
My main objection about method `S3Hook.download_file` if it would be
initially something like
```python
@provide_bucket_name
@unify_bucket_name_and_key
def download_file(
self,
key: str,
bucket_name: str | None = None,
filename: str | None = None,
extra_args=None,
config=None
) -> None:
self.conn.download_file(
bucket_name, key, filename, ExtraArgs=extra_args, Callback=None,
Config=config
)
```
Then we do not even have this discussion and modification not required
because it just would be simple thin wrapper around one boto3 s3 client method
which resolved different combination of key and bucket name.
But we have what we have and we cannot revert time back. So it might a lot
of user code exists which use this method, how users use current method and
what they additionally wanted. I could just predict:
- User want to save to location which mount to persistent storage (NFS
Share) and also want to keep `prefix` not only `filename`
- User want to check is this file already exists in File System and raise an
error
- User want to add Object Version to the filename
- User want to delete file in the end of the `execute()` method of his
operator / callable function
My current idea - just inform (at least in docstring) that might exists
better way which could help to solve their cases rather than try to use this
method.
Also for better user experience we could create some additional static
method in hook which could be use for split s3 URL to the parts, it might help
them to do some specific stuff without write some additional code.
**Sample of helper**
```python
from typing import NamedTuple
from urllib.parse import urlparse
class S3UrlParts(NamedTuple):
bucket: str
key: str | None
prefix: str | None
parts: list[str]
filename: str | None
basename: str | None
suffix: str | None
suffixes: list[str]
def parse_url(url: str):
if "://" not in url:
raise ValueError(f"Invalid S3 URL: {url!r}")
url_parts = urlparse(url)
if url_parts.scheme != "s3":
# Yeah, I know that about 4 ways to define S3 URL and not limited by
s3://bucket/prefix/key.json
raise ValueError(f"Invalid S3 URL (incorrect schema): {url!r}")
elif not url_parts.netloc:
raise ValueError(f"Invalid S3 URL (no bucket name): {url!r}")
bucket = url_parts.netloc
key = url_parts.path.lstrip("/")
key_parts = key.rsplit("/", 1)
if len(key_parts) == 2:
prefix, filename = key_parts
else:
prefix, filename = None, key_parts[0]
parts = []
basename = None
suffix = None
suffixes = []
if prefix:
parts.extend(prefix.split("/"))
if filename:
parts.append(filename)
file_parts = filename.split(".", 1)
if len(file_parts) == 2:
basename, suffix = file_parts
suffixes = suffix.split(".")
else:
basename, suffix = file_parts[0], None
return S3UrlParts(
bucket,
key=key,
prefix=prefix or None,
parts=parts,
filename=filename or None,
basename=basename,
suffix=suffix,
suffixes=suffixes
)
```
**Sample of "user" code**
```python
s3_urls = [
"s3://awesome-bucket/this/is/path/to/awesome-data.parquet.snappy",
"s3://awesome-bucket",
"s3://awesome-bucket/",
"s3://awesome-bucket/prefix/",
"s3://awesome-bucket/key.json",
"s3://awesome-bucket/key-with-no-extension",
"s3://awesome-bucket/key-with-dot-in-the-end.",
]
for s3_url in s3_urls:
print(f"S3 URL: {s3_url}")
print(f"Parsed URL: {parse_url(s3_url)}")
print()
```
**Sample Output**
```
S3 URL: s3://awesome-bucket/this/is/path/to/awesome-data.parquet.snappy
S3UrlParts(bucket='awesome-bucket',
key='this/is/path/to/awesome-data.parquet.snappy', prefix='this/is/path/to',
parts=['this', 'is', 'path', 'to', 'awesome-data.parquet.snappy'],
filename='awesome-data.parquet.snappy', basename='awesome-data',
suffix='parquet.snappy', suffixes=['parquet', 'snappy'])
S3 URL: s3://awesome-bucket
S3UrlParts(bucket='awesome-bucket', key='', prefix=None, parts=[],
filename=None, basename=None, suffix=None, suffixes=[])
S3 URL: s3://awesome-bucket/
S3UrlParts(bucket='awesome-bucket', key='', prefix=None, parts=[],
filename=None, basename=None, suffix=None, suffixes=[])
S3 URL: s3://awesome-bucket/prefix/
S3UrlParts(bucket='awesome-bucket', key='prefix/', prefix='prefix',
parts=['prefix'], filename=None, basename=None, suffix=None, suffixes=[])
S3 URL: s3://awesome-bucket/key.json
S3UrlParts(bucket='awesome-bucket', key='key.json', prefix=None,
parts=['key.json'], filename='key.json', basename='key', suffix='json',
suffixes=['json'])
S3 URL: s3://awesome-bucket/key-with-no-extension
S3UrlParts(bucket='awesome-bucket', key='key-with-no-extension',
prefix=None, parts=['key-with-no-extension'], filename='key-with-no-extension',
basename='key-with-no-extension', suffix=None, suffixes=[])
S3 URL: s3://awesome-bucket/key-with-dot-in-the-end.
S3UrlParts(bucket='awesome-bucket', key='key-with-dot-in-the-end.',
prefix=None, parts=['key-with-dot-in-the-end.'],
filename='key-with-dot-in-the-end.', basename='key-with-dot-in-the-end',
suffix='', suffixes=[''])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]