cccs-eric commented on code in PR #6970:
URL: https://github.com/apache/iceberg/pull/6970#discussion_r1123014890
##########
python/tests/conftest.py:
##########
@@ -1277,8 +1277,8 @@ def adlfs_fsspec_fileio(request: pytest.FixtureRequest)
-> Generator[FsspecFileI
azurite_account_key = request.config.getoption("--adlfs.account-key")
azurite_connection_string =
f"DefaultEndpointsProtocol=http;AccountName={azurite_account_name};AccountKey={azurite_account_key};BlobEndpoint={azurite_url}/{azurite_account_name};"
properties = {
- "connection_string": azurite_connection_string,
- "account_name": azurite_account_name,
+ "adlfs.connection-string": azurite_connection_string,
+ "adlfs.account-name": azurite_account_name,
Review Comment:
When I wrote that code, I only supported what I could test, which is Azure
Gen2 Datalake (or Storage Account). Gen2 is nice since it provides file system
semantics, file-level security, and scale combined with blob storage benefits.
That's why `pyiceberg.io.fsspec` only uses a `AzureBlobFileSystem` instance
(which supports `abfs` and `az` protocols). Could that be extended to support
Gen1? Probably since fsspec supports it, but I wasn't in a position to test it.
As for the required configuration, Azure supports multiple ways to
authenticate and that's why I left it open, to support all possibilites:
```python
def _adlfs(properties: Properties) -> AbstractFileSystem:
from adlfs import AzureBlobFileSystem
fs = AzureBlobFileSystem(**properties)
return fs
```
Authentication is handled by `AzureBlobFileSystem`, who delegates to [Azure
Identity](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity).
But I understand that this does not provide good documentation for a user,
so I'm all for listing configuration options. The question is do we want to
limit those options to only a few (so we code checks as @alaturqua suggested),
or we provide examples in the documentation and we delegate validation to
`AzureBlobFileSystem`?
It can get tricky if we want to code checks for every possibility. As an
example, here is the documentation for `account_key` and `sas_token`:
```
account_key: str
The storage account key. This is used for shared key authentication.
If any of account key, sas token or client_id is specified,
anonymous access
will be used.
sas_token: str
A shared access signature token to use to authenticate requests
instead of the account key. If account key and sas token are both
specified, account key will be used to sign. If any of account key,
sas token
or client_id are specified, anonymous access will be used.
```
I'm not saying it cannot be done, but it requires attention.
I don't know what is Iceberg philosophy when it comes to using an external
library, i.e. should Iceberg retain full control or delegation is fine?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]