sugibuchi opened a new issue, #43197: URL: https://github.com/apache/arrow/issues/43197
### Describe the enhancement requested ## Outline The Azure Blob File System (ABFS) support in Apache Arrow, implemented in C++ API by #18014 and integrated into Python API by #39968, currently allows embedding a storage account key as a password in an ABFS URL. https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144 However, I strongly recommend stopping this practice for two reasons. ## Security An access key of a storage account is practically a "root password," giving full access to the data in the storage account. Microsoft repeatedly emphasises this point in various places in the documentation and encourages the protection of shared keys in a secure place like Azure Key Vault. > ## Protect your access keys > Storage account access keys provide full access to the storage account data and the ability to generate SAS tokens. Always be careful to protect your access keys. Use Azure Key Vault to manage and rotate your keys securely. > https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string#protect-your-access-keys Embedding a storage account key in an ABFS URL, which is usually not considered confidential information, may lead to unexpected exposure of the key. ## Interoperability with other file system implementations For historical reasons, the Azure Blob File System (ABFS) URL schemes are inconsistent between different file system implementations. Original implementations by Apache Hadoop's `hadoop-azure` package [link](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver#uri-scheme-to-reference-data) * adfs[s]://\<container\>@\<account\>.dsf.core.windows.net/path/to/file These URL schemes are widely used, particularly by Apache Spark. Python `adlfs` for `fsspec` [link](https://github.com/fsspec/adlfs?tab=readme-ov-file#quickstart) * Hadoop-compatible URL schemes, and * az://\<container\>/path/to/file * adfs[s]://\<container\>/path/to/file Rust `object_store::azure` [link](https://docs.rs/object_store/latest/src/object_store/azure/builder.rs.html#473-487) * Hadoop-compatible URL schemes * adlfs-compatible URL schemes, and * azure://\<container\>/path/to/file * https://\<account\>.blob.core.windows.net/\<container\>/path/to/file * https://\<account\>.dfs.core.windows.net/\<container\>/path/to/file DuckDB `azure` extension [link](https://duckdb.org/docs/extensions/azure#for-azure-data-lake-storage-adls) * adfss://\<container\>/path/to/file * adfss://\<account\>.dsf.core.windows.net/\<container\>/path/to/file Apache Arrow [link](https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144) * Hadoop-compatible URL schemes, and * adfs[s]://\<container\>:\<password\>@\<account\>.dsf.core.windows.net/path/to/file * adfs[s]://\<account\>.dsf.core.windows.net/\<container\>/path/to/file * adfs[s]://\<password\>@\<account\>.dsf.core.windows.net/\<container\>/path/to/file This consistency of the URL scheme already causes problems in applications using different frameworks, including additional overhead to translate ABFS URLs between different schemes. It may also lead to unexpected behaviours due to misinterpretation of the same URL by different file system implementations. I believe a new file system implementation should respect the existing URL schemes and SHOULD NOT invent new ones. As far as I understand, no other ABFS file system implementation allows embedding storage account keys in ABFS URLs. ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
