anmolanmol1234 commented on code in PR #7540: URL: https://github.com/apache/hadoop/pull/7540#discussion_r2026883164
########## hadoop-tools/hadoop-azure/src/site/markdown/index.md: ########## @@ -12,553 +12,1479 @@ limitations under the License. See accompanying LICENSE file. --> -# Hadoop Azure Support: Azure Blob Storage +# Hadoop Azure Support: ABFS - Azure Data Lake Storage Gen2 <!-- MACRO{toc|fromDepth=1|toDepth=3} --> -See also: - -* [WASB](./wasb.html) -* [ABFS](./abfs.html) -* [Namespace Disabled Accounts on ABFS](./fns_blob.html) -* [Testing](./testing_azure.html) - -## Introduction +## <a name="introduction"></a> Introduction -The `hadoop-azure` module provides support for integration with -[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/). -The built jar file, named `hadoop-azure.jar`, also declares transitive dependencies -on the additional artifacts it requires, notably the -[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java). +The `hadoop-azure` module provides support for the Azure Data Lake Storage Gen2 +storage layer through the "abfs" connector -To make it part of Apache Hadoop's default classpath, simply make sure that -`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list. -Example: +To make it part of Apache Hadoop's default classpath, make sure that +`HADOOP_OPTIONAL_TOOLS` environment variable has `hadoop-azure` in the list, +*on every machine in the cluster* ```bash -export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake" +export HADOOP_OPTIONAL_TOOLS=hadoop-azure ``` -## Features -* Read and write data stored in an Azure Blob Storage account. -* Present a hierarchical file system view by implementing the standard Hadoop +You can set this locally in your `.profile`/`.bashrc`, but note it won't +propagate to jobs running in-cluster. + +See also: +* [FNS (non-HNS)](./fns_blob.html) +* [Legacy-Deprecated-WASB](./wasb.html) +* [Testing](./testing_azure.html) + +## <a name="features"></a> Features of the ABFS connector. + +* Supports reading and writing data stored in an Azure Blob Storage account. +* *Fully Consistent* view of the storage across all clients. +* Can read data written through the ` deprecated wasb:` connector. +* Presents a hierarchical file system view by implementing the standard Hadoop [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface. * Supports configuration of multiple Azure Blob Storage accounts. -* Supports both block blobs (suitable for most use cases, such as MapReduce) and - page blobs (suitable for continuous write use cases, such as an HBase - write-ahead log). -* Reference file system paths using URLs using the `wasb` scheme. -* Also reference file system paths using URLs with the `wasbs` scheme for SSL - encrypted access. -* Can act as a source of data in a MapReduce job, or a sink. -* Tested on both Linux and Windows. -* Tested at scale. - -## Limitations - -* File owner and group are persisted, but the permissions model is not enforced. - Authorization occurs at the level of the entire Azure Blob Storage account. -* File last access time is not tracked. +* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark. +* Tested at scale on both Linux and Windows by Microsoft themselves. +* Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure infrastructure. + +For details on ABFS, consult the following documents: + +* [A closer look at Azure Data Lake Storage Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/); +MSDN Article from June 28, 2018. +* [Storage Tiers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers) -## Usage +## Getting started ### Concepts -The Azure Blob Storage data model presents 3 core concepts: +The Azure Storage data model presents 3 core concepts: * **Storage Account**: All access is done through a storage account. * **Container**: A container is a grouping of multiple blobs. A storage account may have multiple containers. In Hadoop, an entire file system hierarchy is - stored in a single container. It is also possible to configure multiple - containers, effectively presenting multiple file systems that can be referenced - using distinct URLs. -* **Blob**: A file of any type and size. In Hadoop, files are stored in blobs. - The internal implementation also uses blobs to persist the file system - hierarchy and other metadata. + stored in a single container. +* **Blob**: A file of any type and size stored with the existing wasb connector -### Configuring Credentials +The ABFS connector connects to classic containers, or those created +with Hierarchical Namespaces. -Usage of Azure Blob Storage requires configuration of credentials. Typically -this is set in core-site.xml. The configuration property name is of the form -`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is the -access key. **The access key is a secret that protects access to your storage -account. Do not share the access key (or the core-site.xml file) with an -untrusted party.** +## <a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility) -For example: +A key aspect of ADLS Gen 2 is its support for +[hierarchical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace) +These are effectively directories and offer high performance rename and delete operations +—something which makes a significant improvement in performance in query engines +writing data to, including MapReduce, Spark, Hive, as well as DistCp. -```xml -<property> - <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> - <value>YOUR ACCESS KEY</value> -</property> -``` -In many Hadoop clusters, the core-site.xml file is world-readable. It is possible to -protect the access key within a credential provider as well. This provides an encrypted -file format along with protection with file permissions. +This feature is only available if the container was created with "namespace" +support. -#### Protecting the Azure Credentials for WASB with Credential Providers +You enable namespace support when creating a new Storage Account, +by checking the "Hierarchical Namespace" option in the Portal UI, or, when +creating through the command line, using the option `--hierarchical-namespace true` -To protect these credentials from prying eyes, it is recommended that you use -the credential provider framework to securely store them and access them -through configuration. The following describes its use for Azure credentials -in WASB FileSystem. +_You cannot enable Hierarchical Namespaces on an existing storage account_ -For additional reading on the credential provider API see: -[Credential Provider API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html). +_**Containers in a storage account with Hierarchical Namespaces are +not (currently) readable through the `deprecated wasb:` connector.**_ -##### End to End Steps for Distcp and WASB with Credential Providers - -###### provision +Some of the `az storage` command line commands fail too, for example: ```bash -% hadoop credential create fs.azure.account.key.youraccount.blob.core.windows.net -value 123 - -provider localjceks://file/home/lmccay/wasb.jceks +$ az storage container list --account-name abfswales1 +Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts ``` -###### configure core-site.xml or command line system property +### <a name="creating"></a> Creating an Azure Storage Account -```xml -<property> - <name>hadoop.security.credential.provider.path</name> - <value>localjceks://file/home/lmccay/wasb.jceks</value> - <description>Path to interrogate for protected credentials.</description> -</property> -``` +The best documentation on getting started with Azure Datalake Gen2 with the +abfs connector is [Using Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdi-cluster) + +It includes instructions to create it from [the Azure command line tool](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest), +which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum). -###### distcp +The [az storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest) subcommand +handles all storage commands, [`az storage account create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create) +does the creation. +Until the ADLS gen2 API support is finalized, you need to add an extension +to the ADLS command. ```bash -% hadoop distcp - [-D hadoop.security.credential.provider.path=localjceks://file/home/lmccay/wasb.jceks] - hdfs://hostname:9001/user/lmccay/007020615 wasb://yourcontai...@youraccount.blob.core.windows.net/testDir/ +az extension add --name storage-preview ``` -NOTE: You may optionally add the provider path property to the distcp command line instead of -added job specific configuration to a generic core-site.xml. The square brackets above illustrate -this capability. +Check that all is well by verifying that the usage command includes `--hierarchical-namespace`: +``` +$ az storage account +usage: az storage account create [-h] [--verbose] [--debug] + [--output {json,jsonc,table,tsv,yaml,none}] + [--query JMESPATH] --resource-group + RESOURCE_GROUP_NAME --name ACCOUNT_NAME + [--sku {Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}] + [--location LOCATION] + [--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}] + [--tags [TAGS [TAGS ...]]] + [--custom-domain CUSTOM_DOMAIN] + [--encryption-services {blob,file,table,queue} [{blob,file,table,queue} ...]] + [--access-tier {Hot,Cool}] + [--https-only [{true,false}]] + [--file-aad [{true,false}]] + [--hierarchical-namespace [{true,false}]] + [--bypass {None,Logging,Metrics,AzureServices} [{None,Logging,Metrics,AzureServices} ...]] + [--default-action {Allow,Deny}] + [--assign-identity] + [--subscription _SUBSCRIPTION] +``` -#### Protecting the Azure Credentials for WASB within an Encrypted File +You can list locations from `az account list-locations`, which lists the +name to refer to in the `--location` argument: +``` +$ az account list-locations -o table + +DisplayName Latitude Longitude Name +------------------- ---------- ----------- ------------------ +East Asia 22.267 114.188 eastasia +Southeast Asia 1.283 103.833 southeastasia +Central US 41.5908 -93.6208 centralus +East US 37.3719 -79.8164 eastus +East US 2 36.6681 -78.3889 eastus2 +West US 37.783 -122.417 westus +North Central US 41.8819 -87.6278 northcentralus +South Central US 29.4167 -98.5 southcentralus +North Europe 53.3478 -6.2597 northeurope +West Europe 52.3667 4.9 westeurope +Japan West 34.6939 135.5022 japanwest +Japan East 35.68 139.77 japaneast +Brazil South -23.55 -46.633 brazilsouth +Australia East -33.86 151.2094 australiaeast +Australia Southeast -37.8136 144.9631 australiasoutheast +South India 12.9822 80.1636 southindia +Central India 18.5822 73.9197 centralindia +West India 19.088 72.868 westindia +Canada Central 43.653 -79.383 canadacentral +Canada East 46.817 -71.217 canadaeast +UK South 50.941 -0.799 uksouth +UK West 53.427 -3.084 ukwest +West Central US 40.890 -110.234 westcentralus +West US 2 47.233 -119.852 westus2 +Korea Central 37.5665 126.9780 koreacentral +Korea South 35.1796 129.0756 koreasouth +France Central 46.3772 2.3730 francecentral +France South 43.8345 2.1972 francesouth +Australia Central -35.3075 149.1244 australiacentral +Australia Central 2 -35.3075 149.1244 australiacentral2 +``` -In addition to using the credential provider framework to protect your credentials, it's -also possible to configure it in encrypted form. An additional configuration property -specifies an external program to be invoked by Hadoop processes to decrypt the -key. The encrypted key value is passed to this external program as a command -line argument: +Once a location has been chosen, create the account +```bash -```xml -<property> - <name>fs.azure.account.keyprovider.youraccount</name> - <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value> -</property> +az storage account create --verbose \ + --name abfswales1 \ + --resource-group devteam2 \ + --kind StorageV2 \ + --hierarchical-namespace true \ + --location ukwest \ + --sku Standard_LRS \ + --https-only true \ + --encryption-services blob \ + --access-tier Hot \ + --tags owner=engineering \ + --assign-identity \ + --output jsonc +``` -<property> - <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> - <value>YOUR ENCRYPTED ACCESS KEY</value> -</property> +The output of the command is a JSON file, whose `primaryEndpoints` command +includes the name of the store endpoint: +```json +{ + "primaryEndpoints": { + "blob": "https://abfswales1.blob.core.windows.net/", + "dfs": "https://abfswales1.dfs.core.windows.net/", + "file": "https://abfswales1.file.core.windows.net/", + "queue": "https://abfswales1.queue.core.windows.net/", + "table": "https://abfswales1.table.core.windows.net/", + "web": "https://abfswales1.z35.web.core.windows.net/" + } +} +``` + +The `abfswales1.dfs.core.windows.net` account is the name by which the +storage account will be referred to. + +Now ask for the connection string to the store, which contains the account key +```bash +az storage account show-connection-string --name abfswales1 +{ + "connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==" +} +``` +You then need to add the access key to your `core-site.xml`, JCEKs file or +use your cluster management tool to set it the option `fs.azure.account.key.STORAGE-ACCOUNT` +to this value. +```XML <property> - <name>fs.azure.shellkeyprovider.script</name> - <value>PATH TO DECRYPTION PROGRAM</value> + <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name> + <value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value> </property> - ``` -### Block Blob with Compaction Support and Configuration +#### Creation through the Azure Portal -Block blobs are the default kind of blob and are good for most big-data use -cases. However, block blobs have strict limit of 50,000 blocks per blob. -To prevent reaching the limit WASB, by default, does not upload new block to -the service after every `hflush()` or `hsync()`. +Creation through the portal is covered in [Quickstart: Create an Azure Data Lake Storage Gen2 storage account](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account) -For most of the cases, combining data from multiple `write()` calls in -blocks of 4Mb is a good optimization. But, in others cases, like HBase log files, -every call to `hflush()` or `hsync()` must upload the data to the service. +Key Steps -Block blobs with compaction upload the data to the cloud service after every -`hflush()`/`hsync()`. To mitigate the limit of 50000 blocks, `hflush() -`/`hsync()` runs once compaction process, if number of blocks in the blob -is above 32,000. +1. Create a new Storage Account in a location which suits you. +1. "Basics" Tab: select "StorageV2". +1. "Advanced" Tab: enable "Hierarchical Namespace". -Block compaction search and replaces a sequence of small blocks with one big -block. That means there is associated cost with block compaction: reading -small blocks back to the client and writing it again as one big block. +You have now created your storage account. Next, get the key for authentication +for using the default "Shared Key" authentication. -In order to have the files you create be block blobs with block compaction -enabled, the client must set the configuration variable -`fs.azure.block.blob.with.compaction.dir` to a comma-separated list of -folder names. +1. Go to the Azure Portal. +1. Select "Storage Accounts" +1. Select the newly created storage account. +1. In the list of settings, locate "Access Keys" and select that. +1. Copy one of the access keys to the clipboard, add to the XML option, +set in cluster management tools, Hadoop JCEKS file or KMS store. -For example: +### <a name="new_container"></a> Creating a new container -```xml -<property> - <name>fs.azure.block.blob.with.compaction.dir</name> - <value>/hbase/WALs,/data/myblobfiles</value> -</property> -``` +An Azure storage account can have multiple containers, each with the container +name as the userinfo field of the URI used to reference it. -### Page Blob Support and Configuration +For example, the container "container1" in the storage account just created +will have the URL `abfs://contain...@abfswales1.dfs.core.windows.net/` -The Azure Blob Storage interface for Hadoop supports two kinds of blobs, -[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx). -Block blobs are the default kind of blob and are good for most big-data use -cases, like input data for Hive, Pig, analytical map-reduce jobs etc. Page blob -handling in hadoop-azure was introduced to support HBase log files. Page blobs -can be written any number of times, whereas block blobs can only be appended to -50,000 times before you run out of blocks and your writes will fail. That won't -work for HBase logs, so page blob support was introduced to overcome this -limitation. -Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block -blobs. -You should stick to block blobs for most usage, and page blobs are only tested in context of HBase write-ahead logs. +You can create a new container through the ABFS connector, by setting the option + `fs.azure.createRemoteFileSystemDuringInitialization` to `true`. Though the + same is not supported when AuthType is SAS. -In order to have the files you create be page blobs, you must set the -configuration variable `fs.azure.page.blob.dir` to a comma-separated list of -folder names. +If the container does not exist, an attempt to list it with `hadoop fs -ls` +will fail -For example: +``` +$ hadoop fs -ls abfs://contain...@abfswales1.dfs.core.windows.net/ -```xml -<property> - <name>fs.azure.page.blob.dir</name> - <value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value> -</property> +ls: `abfs://contain...@abfswales1.dfs.core.windows.net/': No such file or directory ``` -You can set this to simply / to make all files page blobs. +Enable remote FS creation and the second attempt succeeds, creating the container as it does so: -The configuration option `fs.azure.page.blob.size` is the default initial -size for a page blob. It must be 128MB or greater, and no more than 1TB, -specified as an integer number of bytes. +``` +$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \ + -ls abfs://contain...@abfswales1.dfs.core.windows.net/ +``` -The configuration option `fs.azure.page.blob.extension.size` is the page blob -extension size. This defines the amount to extend a page blob if it starts to -get full. It must be 128MB or greater, specified as an integer number of bytes. +This is useful for creating accounts on the command line, especially before +the `az storage` command supports hierarchical namespaces completely. -### Custom User-Agent -WASB passes User-Agent header to the Azure back-end. The default value -contains WASB version, Java Runtime version, Azure Client library version, and the -value of the configuration option `fs.azure.user.agent.prefix`. Customized User-Agent -header enables better troubleshooting and analysis by Azure service. -```xml -<property> - <name>fs.azure.user.agent.prefix</name> - <value>Identifier</value> -</property> -``` +### Listing and examining containers of a Storage Account. -### Atomic Folder Rename +You can use the [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/) -Azure storage stores files as a flat key/value store without formal support -for folders. The hadoop-azure file system layer simulates folders on top -of Azure storage. By default, folder rename in the hadoop-azure file system -layer is not atomic. That means that a failure during a folder rename -could, for example, leave some folders in the original directory and -some in the new one. +## <a name="configuring"></a> Configuring ABFS -HBase depends on atomic folder rename. Hence, a configuration setting was -introduced called `fs.azure.atomic.rename.dir` that allows you to specify a -comma-separated list of directories to receive special treatment so that -folder rename is made atomic. The default value of this setting is just -`/hbase`. Redo will be applied to finish a folder rename that fails. A file -`<folderName>-renamePending.json` may appear temporarily and is the record of -the intention of the rename operation, to allow redo in event of a failure. +Any configuration can be specified generally (or as the default when accessing all accounts) +or can be tied to a specific account. +For example, an OAuth identity can be configured for use regardless of which +account is accessed with the property `fs.azure.account.oauth2.client.id` +or you can configure an identity to be used only for a specific storage account with +`fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net`. -For example: +This is shown in the Authentication section. -```xml -<property> - <name>fs.azure.atomic.rename.dir</name> - <value>/hbase,/data</value> -</property> -``` +## <a name="authentication"></a> Authentication -### Accessing wasb URLs +Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios). -After credentials are configured in core-site.xml, any Hadoop component may -reference files in that Azure Blob Storage account by using URLs of the following -format: +The concepts covered there are beyond the scope of this document to cover; +developers are expected to have read and understood the concepts therein +to take advantage of the different authentication mechanisms. - wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path> +What is covered here, briefly, is how to configure the ABFS client to authenticate +in different deployment situations. -The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure -Blob Storage. `wasb` utilizes unencrypted HTTP access for all interaction with -the Azure Blob Storage API. `wasbs` utilizes SSL encrypted HTTPS access. +The ABFS client can be deployed in different ways, with its authentication needs +driven by them. -For example, the following -[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html) -commands demonstrate access to a storage account named `youraccount` and a -container named `yourcontainer`. +1. With the storage account's authentication secret in the configuration: "Shared Key". +2. Using OAuth 2.0 tokens of one form or another. +3. Deployed in-Azure with the Azure VMs providing OAuth 2.0 tokens to the application, "Managed Instance". +4. Using Shared Access Signature (SAS) tokens provided by a custom implementation of the SASTokenProvider interface. +5. By directly configuring a fixed Shared Access Signature (SAS) token in the account configuration settings files. -```bash -% hadoop fs -mkdir wasb://yourcontai...@youraccount.blob.core.windows.net/testDir +Note: SAS Based Authentication should be used only with HNS Enabled accounts. -% hadoop fs -put testFile wasb://yourcontai...@youraccount.blob.core.windows.net/testDir/testFile +What can be changed is what secrets/credentials are used to authenticate the caller. -% hadoop fs -cat wasbs://yourcontai...@youraccount.blob.core.windows.net/testDir/testFile -test file content -``` +The authentication mechanism is set in `fs.azure.account.auth.type` (or the +account specific variant). The possible values are SharedKey, OAuth, Custom +and SAS. For the various OAuth options use the config `fs.azure.account.oauth.provider.type`. Following are the implementations supported +ClientCredsTokenProvider, UserPasswordTokenProvider, MsiTokenProvider, +RefreshTokenBasedTokenProvider and WorkloadIdentityTokenProvider. An IllegalArgumentException is thrown if +the specified provider type is not one of the supported. + +All secrets can be stored in JCEKS files. These are encrypted and password +protected —use them or a compatible Hadoop Key Management Store wherever +possible + +### <a name="aad-token-fetch-retry-logic"></a> AAD Token fetch retries -It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL. -This causes all bare paths, such as `/testDir/testFile` to resolve automatically -to that file system. +The exponential retry policy used for the AAD token fetch retries can be tuned +with the following configurations. +* `fs.azure.oauth.token.fetch.retry.max.retries`: Sets the maximum number of + retries. Default value is 5. +* `fs.azure.oauth.token.fetch.retry.min.backoff.interval`: Minimum back-off + interval. Added to the retry interval computed from delta backoff. By + default this is set as 0. Set the interval in milli seconds. +* `fs.azure.oauth.token.fetch.retry.max.backoff.interval`: Maximum back-off +interval. Default value is 60000 (sixty seconds). Set the interval in milli +seconds. +* `fs.azure.oauth.token.fetch.retry.delta.backoff`: Back-off interval between +retries. Multiples of this timespan are used for subsequent retry attempts + . The default value is 2. -### Append API Support and Configuration +### <a name="shared-key-auth"></a> Default: Shared Key -The Azure Blob Storage interface for Hadoop has optional support for Append API for -single writer by setting the configuration `fs.azure.enable.append.support` to true. +This is the simplest authentication mechanism of account + password. -For Example: +The account name is inferred from the URL; +the password, "key", retrieved from the XML/JCECKs configuration files. ```xml <property> - <name>fs.azure.enable.append.support</name> - <value>true</value> + <name>fs.azure.account.auth.type.ACCOUNT_NAME.dfs.core.windows.net</name> + <value>SharedKey</value> + <description> + </description> +</property> +<property> + <name>fs.azure.account.key.ACCOUNT_NAME.dfs.core.windows.net</name> + <value>ACCOUNT_KEY</value> + <description> + The secret password. Never share these. + </description> </property> ``` -It must be noted Append support in Azure Blob Storage interface DIFFERS FROM HDFS SEMANTICS. Append -support does not enforce single writer internally but requires applications to guarantee this semantic. -It becomes a responsibility of the application either to ensure single-threaded handling for a particular -file path, or rely on some external locking mechanism of its own. Failure to do so will result in -unexpected behavior. +*Note*: The source of the account key can be changed through a custom key provider; +one exists to execute a shell script to retrieve it. -### Multithread Support +A custom key provider class can be provided with the config +`fs.azure.account.keyprovider`. If a key provider class is specified the same +will be used to get account key. Otherwise the Simple key provider will be used +which will use the key specified for the config `fs.azure.account.key`. -Rename and Delete blob operations on directories with large number of files and sub directories currently is very slow as these operations are done one blob at a time serially. These files and sub folders can be deleted or renamed parallel. Following configurations can be used to enable threads to do parallel processing +To retrieve using shell script, specify the path to the script for the config +`fs.azure.shellkeyprovider.script`. ShellDecryptionKeyProvider class use the +script specified to retrieve the key. -To enable 10 threads for Delete operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled. +### <a name="oauth-client-credentials"></a> OAuth 2.0 Client Credentials -```xml -<property> - <name>fs.azure.delete.threads</name> - <value>10</value> -</property> -``` +OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file. -To enable 20 threads for Rename operation. Set configuration value to 0 or 1 to disable threads. The default behavior is threads disabled. +The specifics of this process is covered +in [hadoop-azure-datalake](../hadoop-azure-datalake/index.html#Configuring_Credentials_and_FileSystem); +the key names are slightly different here. ```xml <property> - <name>fs.azure.rename.threads</name> - <value>20</value> + <name>fs.azure.account.auth.type</name> + <value>OAuth</value> + <description> + Use OAuth authentication + </description> +</property> +<property> + <name>fs.azure.account.oauth.provider.type</name> + <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> + <description> + Use client credentials + </description> +</property> +<property> + <name>fs.azure.account.oauth2.client.endpoint</name> + <value></value> Review Comment: Should we add the format here or something like endpoint instead of keeping it empty -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org