[
https://issues.apache.org/jira/browse/HADOOP-19379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043757#comment-18043757
]
ASF GitHub Bot commented on HADOOP-19379:
-----------------------------------------
MedAnd commented on PR #7252:
URL: https://github.com/apache/hadoop/pull/7252#issuecomment-3630402071
Hi @anujmodi / @KeeProMise / @haiyang1987 / @Hexiaoqiao / @aajisaka /
@ZanderXu,
I’m exploring options for local development where my PySpark Jupyter
notebooks need to read files from Azurite (both running in local containers)
and hopefully later run with minimum changes in Azure Synapse Spark in
production.
My goal is:
- Minimum or zero code changes between local and production environments.
- Local development using PySpark Jupyter running as a Docker container
- Local development using Azurite (Microsoft's Azure Blob Storage Emulator)
running as a Docker container
- Efficient reading / writing of files from Azurite, prefer setting up and
using official jars (Java) vs Azure SDK for Python
I understand the Hadoop Azure connector (wasb:// / abfs://) is the
recommended approach for Azure Synapse, but for local dev with Azurite I’m
unsure which JAR(s) or configuration to use within PySpark Jupyter Notebooks.
Given both PySpark Jupyter and Azurite run in separate containers, I need the
connector to support Azurite's URL which is on the same docker internal network
as PySpark. I cannot use localhost or 127.0.0.1.
Any guidance on the best practice for this scenario would be greatly
appreciated.
**PS. Thanks for your work on these connectors!**
> [ABFS] Support Azurite storage emulator
> ---------------------------------------
>
> Key: HADOOP-19379
> URL: https://issues.apache.org/jira/browse/HADOOP-19379
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/azure, test
> Affects Versions: 3.4.1
> Reporter: Yan Zhao
> Priority: Minor
> Labels: pull-request-available
>
> In the integration test case, we will start the Azurite using the test
> container, and we need to access Azurite at another container.
> Now, the Azurite emulator uri values is fixed at 127.0.0.1, so at another
> container, it can't access the Azurite service.
> So I want to introduce a new config `fs.azure.storage.emulator.proxy.url`, so
> that I can config `fs.azure.storage.emulator.proxy.url=http://\{AzuriteIp}`,
> then it can access the Azurite service.
>
>
> {code:java}
> private void connectUsingCredentials(String accountName,
> StorageCredentials credentials, String containerName)
> throws URISyntaxException, StorageException, AzureException {
> URI blobEndPoint;
> if (isStorageEmulatorAccount(accountName)) {
> isStorageEmulator = true;
> CloudStorageAccount account =
> CloudStorageAccount.getDevelopmentStorageAccount();
> storageInteractionLayer.createBlobClient(account);
> } else {
> blobEndPoint = new URI(getHTTPScheme() + "://" + accountName);
> storageInteractionLayer.createBlobClient(blobEndPoint, credentials);
> }
> suppressRetryPolicyInClientIfNeeded();
> // Capture the container reference for debugging purposes.
> container = storageInteractionLayer.getContainerReference(containerName);
> rootDirectory = container.getDirectoryReference("");
> // Can only create container if using account key credentials
> canCreateOrModifyContainer = credentials instanceof
> StorageCredentialsAccountAndKey;
> } {code}
>
> {code:java}
> public static CloudStorageAccount getDevelopmentStorageAccount() {
> try {
> return getDevelopmentStorageAccount(null);
> }
> catch (final URISyntaxException e) {
> // this won't happen since we know the standard development stororage
> uri is valid.
> return null;
> }
> }
> public static CloudStorageAccount getDevelopmentStorageAccount(final URI
> proxyUri) throws URISyntaxException {
> String scheme;
> String host;
> if (proxyUri == null) {
> scheme = "http";
> host = "127.0.0.1";
> }
> else {
> scheme = proxyUri.getScheme();
> host = proxyUri.getHost();
> }
> StorageCredentials credentials = new
> StorageCredentialsAccountAndKey(DEVSTORE_ACCOUNT_NAME,
> DEVSTORE_ACCOUNT_KEY);
> URI blobPrimaryEndpoint = new
> URI(String.format(DEVELOPMENT_STORAGE_PRIMARY_ENDPOINT_FORMAT, scheme, host,
> "10000", DEVSTORE_ACCOUNT_NAME));
> URI queuePrimaryEndpoint = new
> URI(String.format(DEVELOPMENT_STORAGE_PRIMARY_ENDPOINT_FORMAT, scheme, host,
> "10001", DEVSTORE_ACCOUNT_NAME));
> URI tablePrimaryEndpoint = new
> URI(String.format(DEVELOPMENT_STORAGE_PRIMARY_ENDPOINT_FORMAT, scheme, host,
> "10002", DEVSTORE_ACCOUNT_NAME));
> URI blobSecondaryEndpoint = new
> URI(String.format(DEVELOPMENT_STORAGE_SECONDARY_ENDPOINT_FORMAT, scheme, host,
> "10000", DEVSTORE_ACCOUNT_NAME));
> URI queueSecondaryEndpoint = new
> URI(String.format(DEVELOPMENT_STORAGE_SECONDARY_ENDPOINT_FORMAT, scheme, host,
> "10001", DEVSTORE_ACCOUNT_NAME));
> URI tableSecondaryEndpoint = new
> URI(String.format(DEVELOPMENT_STORAGE_SECONDARY_ENDPOINT_FORMAT, scheme, host,
> "10002", DEVSTORE_ACCOUNT_NAME));
> CloudStorageAccount account = new CloudStorageAccount(credentials, new
> StorageUri(blobPrimaryEndpoint,
> blobSecondaryEndpoint), new StorageUri(queuePrimaryEndpoint,
> queueSecondaryEndpoint), new StorageUri(
> tablePrimaryEndpoint, tableSecondaryEndpoint), null /*
> fileStorageUri */);
> account.isDevStoreAccount = true;
> return account;
> } {code}
> CloudStorageAccount account =
> CloudStorageAccount.getDevelopmentStorageAccount(); will using 127.0.0.1 as
> the `azurite` host.
>
> In fact, here we can pass into a proxy uri by invoke
> getDevelopmentStorageAccount(final URI proxyUri)
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]