Re: [PR] HADOOP-19515. ABFS: Updating Documentations of Hadoop Drivers for Azure [hadoop]

via GitHub Sat, 05 Apr 2025 10:18:50 -0700


anmolanmol1234 commented on code in PR #7540:
URL: https://github.com/apache/hadoop/pull/7540#discussion_r2026883164



##########
hadoop-tools/hadoop-azure/src/site/markdown/index.md:
##########
@@ -12,553 +12,1479 @@
   limitations under the License. See accompanying LICENSE file.
 -->
 
-# Hadoop Azure Support: Azure Blob Storage
+# Hadoop Azure Support: ABFS  - Azure Data Lake Storage Gen2
 
 <!-- MACRO{toc|fromDepth=1|toDepth=3} -->
 
-See also:
-
-* [WASB](./wasb.html)
-* [ABFS](./abfs.html)
-* [Namespace Disabled Accounts on ABFS](./fns_blob.html)
-* [Testing](./testing_azure.html)
-
-## Introduction
+## <a name="introduction"></a> Introduction
 
-The `hadoop-azure` module provides support for integration with
-[Azure Blob 
Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
-The built jar file, named `hadoop-azure.jar`, also declares transitive 
dependencies
-on the additional artifacts it requires, notably the
-[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
+The `hadoop-azure` module provides support for the Azure Data Lake Storage Gen2
+storage layer through the "abfs" connector
 
-To make it part of Apache Hadoop's default classpath, simply make sure that
-`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list.
-Example:
+To make it part of Apache Hadoop's default classpath, make sure that
+`HADOOP_OPTIONAL_TOOLS` environment variable has `hadoop-azure` in the list,
+*on every machine in the cluster*
 
 ```bash
-export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
+export HADOOP_OPTIONAL_TOOLS=hadoop-azure
 ```
-## Features
 
-* Read and write data stored in an Azure Blob Storage account.
-* Present a hierarchical file system view by implementing the standard Hadoop
+You can set this locally in your `.profile`/`.bashrc`, but note it won't
+propagate to jobs running in-cluster.
+
+See also:
+* [FNS (non-HNS)](./fns_blob.html)
+* [Legacy-Deprecated-WASB](./wasb.html)
+* [Testing](./testing_azure.html)
+
+## <a name="features"></a> Features of the ABFS connector.
+
+* Supports reading and writing data stored in an Azure Blob Storage account.
+* *Fully Consistent* view of the storage across all clients.
+* Can read data written through the ` deprecated wasb:` connector.
+* Presents a hierarchical file system view by implementing the standard Hadoop
   [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
 * Supports configuration of multiple Azure Blob Storage accounts.
-* Supports both block blobs (suitable for most use cases, such as MapReduce) 
and
-  page blobs (suitable for continuous write use cases, such as an HBase
-  write-ahead log).
-* Reference file system paths using URLs using the `wasb` scheme.
-* Also reference file system paths using URLs with the `wasbs` scheme for SSL
-  encrypted access.
-* Can act as a source of data in a MapReduce job, or a sink.
-* Tested on both Linux and Windows.
-* Tested at scale.
-
-## Limitations
-
-* File owner and group are persisted, but the permissions model is not 
enforced.
-  Authorization occurs at the level of the entire Azure Blob Storage account.
-* File last access time is not tracked.
+* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, 
Apache Spark.
+* Tested at scale on both Linux and Windows by Microsoft themselves.
+* Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure 
infrastructure.
+
+For details on ABFS, consult the following documents:
+
+* [A closer look at Azure Data Lake Storage 
Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/);
+MSDN Article from June 28, 2018.
+* [Storage 
Tiers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers)
 
-## Usage
+## Getting started
 
 ### Concepts
 
-The Azure Blob Storage data model presents 3 core concepts:
+The Azure Storage data model presents 3 core concepts:
 
 * **Storage Account**: All access is done through a storage account.
 * **Container**: A container is a grouping of multiple blobs.  A storage 
account
   may have multiple containers.  In Hadoop, an entire file system hierarchy is
-  stored in a single container.  It is also possible to configure multiple
-  containers, effectively presenting multiple file systems that can be 
referenced
-  using distinct URLs.
-* **Blob**: A file of any type and size.  In Hadoop, files are stored in blobs.
-  The internal implementation also uses blobs to persist the file system
-  hierarchy and other metadata.
+  stored in a single container.
+* **Blob**: A file of any type and size stored with the existing wasb connector
 
-### Configuring Credentials
+The ABFS connector connects to classic containers, or those created
+with Hierarchical Namespaces.
 
-Usage of Azure Blob Storage requires configuration of credentials.  Typically
-this is set in core-site.xml.  The configuration property name is of the form
-`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is 
the
-access key.  **The access key is a secret that protects access to your storage
-account.  Do not share the access key (or the core-site.xml file) with an
-untrusted party.**
+## <a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility)
 
-For example:
+A key aspect of ADLS Gen 2 is its support for
+[hierarchical 
namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
+These are effectively directories and offer high performance rename and delete 
operations
+—something which makes a significant improvement in performance in query 
engines
+writing data to, including MapReduce, Spark, Hive, as well as DistCp.
 
-```xml
-<property>
-  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
-  <value>YOUR ACCESS KEY</value>
-</property>
-```
-In many Hadoop clusters, the core-site.xml file is world-readable. It is 
possible to
-protect the access key within a credential provider as well. This provides an 
encrypted
-file format along with protection with file permissions.
+This feature is only available if the container was created with "namespace"
+support.
 
-#### Protecting the Azure Credentials for WASB with Credential Providers
+You enable namespace support when creating a new Storage Account,
+by checking the "Hierarchical Namespace" option in the Portal UI, or, when
+creating through the command line, using the option `--hierarchical-namespace 
true`
 
-To protect these credentials from prying eyes, it is recommended that you use
-the credential provider framework to securely store them and access them
-through configuration. The following describes its use for Azure credentials
-in WASB FileSystem.
+_You cannot enable Hierarchical Namespaces on an existing storage account_
 
-For additional reading on the credential provider API see:
-[Credential Provider 
API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
+_**Containers in a storage account with Hierarchical Namespaces are
+not (currently) readable through the `deprecated wasb:` connector.**_
 
-##### End to End Steps for Distcp and WASB with Credential Providers
-
-###### provision
+Some of the `az storage` command line commands fail too, for example:
 
 ```bash
-% hadoop credential create 
fs.azure.account.key.youraccount.blob.core.windows.net -value 123
-    -provider localjceks://file/home/lmccay/wasb.jceks
+$ az storage container list --account-name abfswales1
+Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: 
BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
 ```
 
-###### configure core-site.xml or command line system property
+### <a name="creating"></a> Creating an Azure Storage Account
 
-```xml
-<property>
-  <name>hadoop.security.credential.provider.path</name>
-  <value>localjceks://file/home/lmccay/wasb.jceks</value>
-  <description>Path to interrogate for protected credentials.</description>
-</property>
-```
+The best documentation on getting started with Azure Datalake Gen2 with the
+abfs connector is [Using Azure Data Lake Storage Gen2 with Azure HDInsight 
clusters](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdi-cluster)
+
+It includes instructions to create it from [the Azure command line 
tool](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest),
+which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).
 
-###### distcp
+The [az 
storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest)
 subcommand
+handles all storage commands, [`az storage account 
create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create)
+does the creation.
 
+Until the ADLS gen2 API support is finalized, you need to add an extension
+to the ADLS command.
 ```bash
-% hadoop distcp
-    [-D 
hadoop.security.credential.provider.path=localjceks://file/home/lmccay/wasb.jceks]
-    hdfs://hostname:9001/user/lmccay/007020615 
wasb://yourcontai...@youraccount.blob.core.windows.net/testDir/
+az extension add --name storage-preview
 ```
 
-NOTE: You may optionally add the provider path property to the distcp command 
line instead of
-added job specific configuration to a generic core-site.xml. The square 
brackets above illustrate
-this capability.
+Check that all is well by verifying that the usage command includes 
`--hierarchical-namespace`:
+```
+$  az storage account
+usage: az storage account create [-h] [--verbose] [--debug]
+     [--output {json,jsonc,table,tsv,yaml,none}]
+     [--query JMESPATH] --resource-group
+     RESOURCE_GROUP_NAME --name ACCOUNT_NAME
+     [--sku 
{Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}]
+     [--location LOCATION]
+     [--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}]
+     [--tags [TAGS [TAGS ...]]]
+     [--custom-domain CUSTOM_DOMAIN]
+     [--encryption-services {blob,file,table,queue} [{blob,file,table,queue} 
...]]
+     [--access-tier {Hot,Cool}]
+     [--https-only [{true,false}]]
+     [--file-aad [{true,false}]]
+     [--hierarchical-namespace [{true,false}]]
+     [--bypass {None,Logging,Metrics,AzureServices} 
[{None,Logging,Metrics,AzureServices} ...]]
+     [--default-action {Allow,Deny}]
+     [--assign-identity]
+     [--subscription _SUBSCRIPTION]
+```
 
-#### Protecting the Azure Credentials for WASB within an Encrypted File
+You can list locations from `az account list-locations`, which lists the
+name to refer to in the `--location` argument:
+```
+$ az account list-locations -o table
+
+DisplayName          Latitude    Longitude    Name
+-------------------  ----------  -----------  ------------------
+East Asia            22.267      114.188      eastasia
+Southeast Asia       1.283       103.833      southeastasia
+Central US           41.5908     -93.6208     centralus
+East US              37.3719     -79.8164     eastus
+East US 2            36.6681     -78.3889     eastus2
+West US              37.783      -122.417     westus
+North Central US     41.8819     -87.6278     northcentralus
+South Central US     29.4167     -98.5        southcentralus
+North Europe         53.3478     -6.2597      northeurope
+West Europe          52.3667     4.9          westeurope
+Japan West           34.6939     135.5022     japanwest
+Japan East           35.68       139.77       japaneast
+Brazil South         -23.55      -46.633      brazilsouth
+Australia East       -33.86      151.2094     australiaeast
+Australia Southeast  -37.8136    144.9631     australiasoutheast
+South India          12.9822     80.1636      southindia
+Central India        18.5822     73.9197      centralindia
+West India           19.088      72.868       westindia
+Canada Central       43.653      -79.383      canadacentral
+Canada East          46.817      -71.217      canadaeast
+UK South             50.941      -0.799       uksouth
+UK West              53.427      -3.084       ukwest
+West Central US      40.890      -110.234     westcentralus
+West US 2            47.233      -119.852     westus2
+Korea Central        37.5665     126.9780     koreacentral
+Korea South          35.1796     129.0756     koreasouth
+France Central       46.3772     2.3730       francecentral
+France South         43.8345     2.1972       francesouth
+Australia Central    -35.3075    149.1244     australiacentral
+Australia Central 2  -35.3075    149.1244     australiacentral2
+```
 
-In addition to using the credential provider framework to protect your 
credentials, it's
-also possible to configure it in encrypted form.  An additional configuration 
property
-specifies an external program to be invoked by Hadoop processes to decrypt the
-key.  The encrypted key value is passed to this external program as a command
-line argument:
+Once a location has been chosen, create the account
+```bash
 
-```xml
-<property>
-  <name>fs.azure.account.keyprovider.youraccount</name>
-  <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
-</property>
+az storage account create --verbose \
+    --name abfswales1 \
+    --resource-group devteam2 \
+    --kind StorageV2 \
+    --hierarchical-namespace true \
+    --location ukwest \
+    --sku Standard_LRS \
+    --https-only true \
+    --encryption-services blob \
+    --access-tier Hot \
+    --tags owner=engineering \
+    --assign-identity \
+    --output jsonc
+```
 
-<property>
-  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
-  <value>YOUR ENCRYPTED ACCESS KEY</value>
-</property>
+The output of the command is a JSON file, whose `primaryEndpoints` command
+includes the name of the store endpoint:
+```json
+{
+  "primaryEndpoints": {
+    "blob": "https://abfswales1.blob.core.windows.net/";,
+    "dfs": "https://abfswales1.dfs.core.windows.net/";,
+    "file": "https://abfswales1.file.core.windows.net/";,
+    "queue": "https://abfswales1.queue.core.windows.net/";,
+    "table": "https://abfswales1.table.core.windows.net/";,
+    "web": "https://abfswales1.z35.web.core.windows.net/";
+  }
+}
+```
+
+The `abfswales1.dfs.core.windows.net` account is the name by which the
+storage account will be referred to.
+
+Now ask for the connection string to the store, which contains the account key
+```bash
+az storage account  show-connection-string --name abfswales1
+{
+  "connectionString": 
"DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA=="
+}
+```
 
+You then need to add the access key to your `core-site.xml`, JCEKs file or
+use your cluster management tool to set it the option 
`fs.azure.account.key.STORAGE-ACCOUNT`
+to this value.
+```XML
 <property>
-  <name>fs.azure.shellkeyprovider.script</name>
-  <value>PATH TO DECRYPTION PROGRAM</value>
+  <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
+  
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
 </property>
-
 ```
 
-### Block Blob with Compaction Support and Configuration
+#### Creation through the Azure Portal
 
-Block blobs are the default kind of blob and are good for most big-data use
-cases. However, block blobs have strict limit of 50,000 blocks per blob.
-To prevent reaching the limit WASB, by default, does not upload new block to
-the service after every `hflush()` or `hsync()`.
+Creation through the portal is covered in [Quickstart: Create an Azure Data 
Lake Storage Gen2 storage 
account](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account)
 
-For most of the cases, combining data from multiple `write()` calls in
-blocks of 4Mb is a good optimization. But, in others cases, like HBase log 
files,
-every call to `hflush()` or `hsync()` must upload the data to the service.
+Key Steps
 
-Block blobs with compaction upload the data to the cloud service after every
-`hflush()`/`hsync()`. To mitigate the limit of 50000 blocks, `hflush()
-`/`hsync()` runs once compaction process, if number of blocks in the blob
-is above 32,000.
+1. Create a new Storage Account in a location which suits you.
+1. "Basics" Tab: select "StorageV2".
+1. "Advanced" Tab: enable "Hierarchical Namespace".
 
-Block compaction search and replaces a sequence of small blocks with one big
-block. That means there is associated cost with block compaction: reading
-small blocks back to the client and writing it again as one big block.
+You have now created your storage account. Next, get the key for authentication
+for using the default "Shared Key" authentication.
 
-In order to have the files you create be block blobs with block compaction
-enabled, the client must set the configuration variable
-`fs.azure.block.blob.with.compaction.dir` to a comma-separated list of
-folder names.
+1. Go to the Azure Portal.
+1. Select "Storage Accounts"
+1. Select the newly created storage account.
+1. In the list of settings, locate "Access Keys" and select that.
+1. Copy one of the access keys to the clipboard, add to the XML option,
+set in cluster management tools, Hadoop JCEKS file or KMS store.
 
-For example:
+### <a name="new_container"></a> Creating a new container
 
-```xml
-<property>
-  <name>fs.azure.block.blob.with.compaction.dir</name>
-  <value>/hbase/WALs,/data/myblobfiles</value>
-</property>
-```
+An Azure storage account can have multiple containers, each with the container
+name as the userinfo field of the URI used to reference it.
 
-### Page Blob Support and Configuration
+For example, the container "container1" in the storage account just created
+will have the URL `abfs://contain...@abfswales1.dfs.core.windows.net/`
 
-The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
-[block blobs and page 
blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
-Block blobs are the default kind of blob and are good for most big-data use
-cases, like input data for Hive, Pig, analytical map-reduce jobs etc.  Page 
blob
-handling in hadoop-azure was introduced to support HBase log files.  Page blobs
-can be written any number of times, whereas block blobs can only be appended to
-50,000 times before you run out of blocks and your writes will fail.  That 
won't
-work for HBase logs, so page blob support was introduced to overcome this
-limitation.
 
-Page blobs can be up to 1TB in size, larger than the maximum 200GB size for 
block
-blobs.
-You should stick to block blobs for most usage, and page blobs are only tested 
in context of HBase write-ahead logs.
+You can create a new container through the ABFS connector, by setting the 
option
+ `fs.azure.createRemoteFileSystemDuringInitialization` to `true`. Though the
+  same is not supported when AuthType is SAS.
 
-In order to have the files you create be page blobs, you must set the
-configuration variable `fs.azure.page.blob.dir` to a comma-separated list of
-folder names.
+If the container does not exist, an attempt to list it with `hadoop fs -ls`
+will fail
 
-For example:
+```
+$ hadoop fs -ls abfs://contain...@abfswales1.dfs.core.windows.net/
 
-```xml
-<property>
-  <name>fs.azure.page.blob.dir</name>
-  <value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
-</property>
+ls: `abfs://contain...@abfswales1.dfs.core.windows.net/': No such file or 
directory
 ```
 
-You can set this to simply / to make all files page blobs.
+Enable remote FS creation and the second attempt succeeds, creating the 
container as it does so:
 
-The configuration option `fs.azure.page.blob.size` is the default initial
-size for a page blob. It must be 128MB or greater, and no more than 1TB,
-specified as an integer number of bytes.
+```
+$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \
+ -ls abfs://contain...@abfswales1.dfs.core.windows.net/
+```
 
-The configuration option `fs.azure.page.blob.extension.size` is the page blob
-extension size.  This defines the amount to extend a page blob if it starts to
-get full.  It must be 128MB or greater, specified as an integer number of 
bytes.
+This is useful for creating accounts on the command line, especially before
+the `az storage` command supports hierarchical namespaces completely.
 
-### Custom User-Agent
-WASB passes User-Agent header to the Azure back-end. The default value
-contains WASB version, Java Runtime version, Azure Client library version, and 
the
-value of the configuration option `fs.azure.user.agent.prefix`. Customized 
User-Agent
-header enables better troubleshooting and analysis by Azure service.
 
-```xml
-<property>
-    <name>fs.azure.user.agent.prefix</name>
-    <value>Identifier</value>
-</property>
-```
+### Listing and examining containers of a Storage Account.
 
-### Atomic Folder Rename
+You can use the [Azure Storage 
Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)
 
-Azure storage stores files as a flat key/value store without formal support
-for folders.  The hadoop-azure file system layer simulates folders on top
-of Azure storage.  By default, folder rename in the hadoop-azure file system
-layer is not atomic.  That means that a failure during a folder rename
-could, for example, leave some folders in the original directory and
-some in the new one.
+## <a name="configuring"></a> Configuring ABFS
 
-HBase depends on atomic folder rename.  Hence, a configuration setting was
-introduced called `fs.azure.atomic.rename.dir` that allows you to specify a
-comma-separated list of directories to receive special treatment so that
-folder rename is made atomic.  The default value of this setting is just
-`/hbase`.  Redo will be applied to finish a folder rename that fails. A file
-`<folderName>-renamePending.json` may appear temporarily and is the record of
-the intention of the rename operation, to allow redo in event of a failure.
+Any configuration can be specified generally (or as the default when accessing 
all accounts)
+or can be tied to a specific account.
+For example, an OAuth identity can be configured for use regardless of which
+account is accessed with the property `fs.azure.account.oauth2.client.id`
+or you can configure an identity to be used only for a specific storage 
account with
+`fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net`.
 
-For example:
+This is shown in the Authentication section.
 
-```xml
-<property>
-  <name>fs.azure.atomic.rename.dir</name>
-  <value>/hbase,/data</value>
-</property>
-```
+## <a name="authentication"></a> Authentication
 
-### Accessing wasb URLs
+Authentication for ABFS is ultimately granted by [Azure Active 
Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios).
 
-After credentials are configured in core-site.xml, any Hadoop component may
-reference files in that Azure Blob Storage account by using URLs of the 
following
-format:
+The concepts covered there are beyond the scope of this document to cover;
+developers are expected to have read and understood the concepts therein
+to take advantage of the different authentication mechanisms.
 
-    wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
+What is covered here, briefly, is how to configure the ABFS client to 
authenticate
+in different deployment situations.
 
-The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure
-Blob Storage.  `wasb` utilizes unencrypted HTTP access for all interaction with
-the Azure Blob Storage API.  `wasbs` utilizes SSL encrypted HTTPS access.
+The ABFS client can be deployed in different ways, with its authentication 
needs
+driven by them.
 
-For example, the following
-[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html)
-commands demonstrate access to a storage account named `youraccount` and a
-container named `yourcontainer`.
+1. With the storage account's authentication secret in the configuration: 
"Shared Key".
+2. Using OAuth 2.0 tokens of one form or another.
+3. Deployed in-Azure with the Azure VMs providing OAuth 2.0 tokens to the 
application, "Managed Instance".
+4. Using Shared Access Signature (SAS) tokens provided by a custom 
implementation of the SASTokenProvider interface.
+5. By directly configuring a fixed Shared Access Signature (SAS) token in the 
account configuration settings files.
 
-```bash
-% hadoop fs -mkdir 
wasb://yourcontai...@youraccount.blob.core.windows.net/testDir
+Note: SAS Based Authentication should be used only with HNS Enabled accounts.
 
-% hadoop fs -put testFile 
wasb://yourcontai...@youraccount.blob.core.windows.net/testDir/testFile
+What can be changed is what secrets/credentials are used to authenticate the 
caller.
 
-% hadoop fs -cat 
wasbs://yourcontai...@youraccount.blob.core.windows.net/testDir/testFile
-test file content
-```
+The authentication mechanism is set in `fs.azure.account.auth.type` (or the
+account specific variant). The possible values are SharedKey, OAuth, Custom
+and SAS. For the various OAuth options use the config 
`fs.azure.account.oauth.provider.type`. Following are the implementations 
supported
+ClientCredsTokenProvider, UserPasswordTokenProvider, MsiTokenProvider,
+RefreshTokenBasedTokenProvider and WorkloadIdentityTokenProvider. An 
IllegalArgumentException is thrown if
+the specified provider type is not one of the supported.
+
+All secrets can be stored in JCEKS files. These are encrypted and password
+protected —use them or a compatible Hadoop Key Management Store wherever
+possible
+
+### <a name="aad-token-fetch-retry-logic"></a> AAD Token fetch retries
 
-It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
-This causes all bare paths, such as `/testDir/testFile` to resolve 
automatically
-to that file system.
+The exponential retry policy used for the AAD token fetch retries can be tuned
+with the following configurations.
+* `fs.azure.oauth.token.fetch.retry.max.retries`: Sets the maximum number of
+ retries. Default value is 5.
+* `fs.azure.oauth.token.fetch.retry.min.backoff.interval`: Minimum back-off
+  interval. Added to the retry interval computed from delta backoff. By
+   default this is set as 0. Set the interval in milli seconds.
+* `fs.azure.oauth.token.fetch.retry.max.backoff.interval`: Maximum back-off
+interval. Default value is 60000 (sixty seconds). Set the interval in milli
+seconds.
+* `fs.azure.oauth.token.fetch.retry.delta.backoff`: Back-off interval between
+retries. Multiples of this timespan are used for subsequent retry attempts
+ . The default value is 2.
 
-### Append API Support and Configuration
+### <a name="shared-key-auth"></a> Default: Shared Key
 
-The Azure Blob Storage interface for Hadoop has optional support for Append 
API for
-single writer by setting the configuration `fs.azure.enable.append.support` to 
true.
+This is the simplest authentication mechanism of account + password.
 
-For Example:
+The account name is inferred from the URL;
+the password, "key", retrieved from the XML/JCECKs configuration files.
 
 ```xml
 <property>
-  <name>fs.azure.enable.append.support</name>
-  <value>true</value>
+  <name>fs.azure.account.auth.type.ACCOUNT_NAME.dfs.core.windows.net</name>
+  <value>SharedKey</value>
+  <description>
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.key.ACCOUNT_NAME.dfs.core.windows.net</name>
+  <value>ACCOUNT_KEY</value>
+  <description>
+  The secret password. Never share these.
+  </description>
 </property>
 ```
 
-It must be noted Append support in Azure Blob Storage interface DIFFERS FROM 
HDFS SEMANTICS. Append
-support does not enforce single writer internally but requires applications to 
guarantee this semantic.
-It becomes a responsibility of the application either to ensure 
single-threaded handling for a particular
-file path, or rely on some external locking mechanism of its own.  Failure to 
do so will result in
-unexpected behavior.
+*Note*: The source of the account key can be changed through a custom key 
provider;
+one exists to execute a shell script to retrieve it.
 
-### Multithread Support
+A custom key provider class can be provided with the config
+`fs.azure.account.keyprovider`. If a key provider class is specified the same
+will be used to get account key. Otherwise the Simple key provider will be used
+which will use the key specified for the config `fs.azure.account.key`.
 
-Rename and Delete blob operations on directories with large number of files 
and sub directories currently is very slow as these operations are done one 
blob at a time serially. These files and sub folders can be deleted or renamed 
parallel. Following configurations can be used to enable threads to do parallel 
processing
+To retrieve using shell script, specify the path to the script for the config
+`fs.azure.shellkeyprovider.script`. ShellDecryptionKeyProvider class use the
+script specified to retrieve the key.
 
-To enable 10 threads for Delete operation. Set configuration value to 0 or 1 
to disable threads. The default behavior is threads disabled.
+### <a name="oauth-client-credentials"></a> OAuth 2.0 Client Credentials
 
-```xml
-<property>
-  <name>fs.azure.delete.threads</name>
-  <value>10</value>
-</property>
-```
+OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in 
the configuration/JCEKS file.
 
-To enable 20 threads for Rename operation. Set configuration value to 0 or 1 
to disable threads. The default behavior is threads disabled.
+The specifics of this process is covered
+in 
[hadoop-azure-datalake](../hadoop-azure-datalake/index.html#Configuring_Credentials_and_FileSystem);
+the key names are slightly different here.
 
 ```xml
 <property>
-  <name>fs.azure.rename.threads</name>
-  <value>20</value>
+  <name>fs.azure.account.auth.type</name>
+  <value>OAuth</value>
+  <description>
+  Use OAuth authentication
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
+  <description>
+  Use client credentials
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.endpoint</name>
+  <value></value>

Review Comment:
   Should we add the format here or something like endpoint instead of keeping 
it empty



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] HADOOP-19515. ABFS: Updating Documentations of Hadoop Drivers for Azure [hadoop]

Reply via email to