creste opened a new issue, #24210:
URL: https://github.com/apache/beam/issues/24210

   ### What would you like to happen?
   
   # Problem
   
   Currently, the Azure Filesystem for the Python SDK only supports 
authenticating using the 
[`AZURE_STORAGE_CONNECTION_STRING`](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/python/apache_beam/io/azure/blobstorageio.py#L109)
 environment variable.  That approach has several limitations:
   - The `AZURE_STORAGE_CONNECTION_STRING` environment variable must be defined 
on all systems where the pipeline executes.  This is difficult to configure 
when using Beam worker-pool sidecar containers with the FlinkRunner because 
Flink may be running in session mode with different Beam pipelines needing 
different connection strings.
   - The call to 
[`BlobServiceClient.from_connection_string()`](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/python/apache_beam/io/azure/blobstorageio.py#L111)
 does not support all of the authentication methods supported by 
[DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).
  For my use case in particular, it does not support [Managed 
Identity](https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview)
 credentials.
   
   # Solution
   
   I plan to address the above limitations in a PR by adding new Azure-specific 
pipeline options described below.
   
   ## `--azure_blob_storage_connection_string`
   Specifies the [Azure Storage Connection 
String](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string).
   
   Can be used instead of the `AZURE_STORAGE_CONNECTION_STRING` environment 
variable or the new `--azure_blob_storage_connection_string` pipeline option 
described below.
   
   Example:
   ```bash
   python -m apache_beam.examples.wordcount \
     --input azfs://devstoreaccount1/container/* \
     --output azfs://devstoreaccount1/container/py-wordcount-integration \
     --azure_blob_storage_connection_string 
"DefaultEndpointsProtocol=https;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=https://azurite:10000/devstoreaccount1;";
   ```
   ## `--azure_blob_storage_account_url`
   Specifies the [Azure Blob Storage Account Endpoint 
URL](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview#standard-endpoints).
   
   Can be used instead of the `AZURE_STORAGE_CONNECTION_STRING` environment 
variable or the new `--azure_blob_storage_connection_string` pipeline option 
described above.
   
   This pipeline option uses 
[`DefaultAzureCredential()`](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#authenticate-with-defaultazurecredential)
 to authenticate.
   
   Example:
   ```bash
   python -m apache_beam.examples.wordcount \
     --input azfs://devstoreaccount1/container/* \
     --output azfs://devstoreaccount1/container/py-wordcount-integration \
     --azure_blob_storage_account_url 
https://mystorageaccount.blob.core.windows.net/
   ```
   
   ## `--azure_managed_identity_client_id`
   Specifies the Managed Identity Client ID.  Can only be used with 
`--azure_blob_storage_account_url`.
   
   This pipeline option uses 
[`DefaultAzureCredential(managed_identity_client_id=client_id)`](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#specify-a-user-assigned-managed-identity-for-defaultazurecredential)
 to authenticate.
   
   Example:
   ```bash
   python -m apache_beam.examples.wordcount \
     --input azfs://devstoreaccount1/container/* \
     --output azfs://devstoreaccount1/container/py-wordcount-integration \
     --azure_blob_storage_account_url 
https://devstoreaccount1.blob.core.windows.net/ \
     --azure_managed_identity_client_id ca6cc1a3-4b82-48bd-97ca-8e799c0abff6
   ```
   # Testing
   Per https://github.com/apache/beam/issues/20511, the Azure Filesystem does 
not have integration tests against Azure or 
[Azurite](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio).
  I plan to add integration tests for the new pipeline options to run against 
Azurite, similar to how [HDFS does its integration 
tests](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/io/hdfs_integration_test).
   
   
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: io-py-ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to