manika137 commented on code in PR #7540:
URL: https://github.com/apache/hadoop/pull/7540#discussion_r2026725478


##########
hadoop-tools/hadoop-azure/src/site/markdown/index.md:
##########
@@ -12,553 +12,1479 @@
   limitations under the License. See accompanying LICENSE file.
 -->
 
-# Hadoop Azure Support: Azure Blob Storage
+# Hadoop Azure Support: ABFS  - Azure Data Lake Storage Gen2
 
 <!-- MACRO{toc|fromDepth=1|toDepth=3} -->
 
-See also:
-
-* [WASB](./wasb.html)
-* [ABFS](./abfs.html)
-* [Namespace Disabled Accounts on ABFS](./fns_blob.html)
-* [Testing](./testing_azure.html)
-
-## Introduction
+## <a name="introduction"></a> Introduction
 
-The `hadoop-azure` module provides support for integration with
-[Azure Blob 
Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
-The built jar file, named `hadoop-azure.jar`, also declares transitive 
dependencies
-on the additional artifacts it requires, notably the
-[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
+The `hadoop-azure` module provides support for the Azure Data Lake Storage Gen2
+storage layer through the "abfs" connector
 
-To make it part of Apache Hadoop's default classpath, simply make sure that
-`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list.
-Example:
+To make it part of Apache Hadoop's default classpath, make sure that
+`HADOOP_OPTIONAL_TOOLS` environment variable has `hadoop-azure` in the list,
+*on every machine in the cluster*
 
 ```bash
-export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
+export HADOOP_OPTIONAL_TOOLS=hadoop-azure
 ```
-## Features
 
-* Read and write data stored in an Azure Blob Storage account.
-* Present a hierarchical file system view by implementing the standard Hadoop
+You can set this locally in your `.profile`/`.bashrc`, but note it won't
+propagate to jobs running in-cluster.
+
+See also:
+* [FNS (non-HNS)](./fns_blob.html)
+* [Legacy-Deprecated-WASB](./wasb.html)
+* [Testing](./testing_azure.html)
+
+## <a name="features"></a> Features of the ABFS connector.
+
+* Supports reading and writing data stored in an Azure Blob Storage account.
+* *Fully Consistent* view of the storage across all clients.
+* Can read data written through the ` deprecated wasb:` connector.
+* Presents a hierarchical file system view by implementing the standard Hadoop
   [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
 * Supports configuration of multiple Azure Blob Storage accounts.
-* Supports both block blobs (suitable for most use cases, such as MapReduce) 
and
-  page blobs (suitable for continuous write use cases, such as an HBase
-  write-ahead log).
-* Reference file system paths using URLs using the `wasb` scheme.
-* Also reference file system paths using URLs with the `wasbs` scheme for SSL
-  encrypted access.
-* Can act as a source of data in a MapReduce job, or a sink.
-* Tested on both Linux and Windows.
-* Tested at scale.
-
-## Limitations
-
-* File owner and group are persisted, but the permissions model is not 
enforced.
-  Authorization occurs at the level of the entire Azure Blob Storage account.
-* File last access time is not tracked.
+* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, 
Apache Spark.
+* Tested at scale on both Linux and Windows by Microsoft themselves.
+* Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure 
infrastructure.
+
+For details on ABFS, consult the following documents:
+
+* [A closer look at Azure Data Lake Storage 
Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/);
+MSDN Article from June 28, 2018.
+* [Storage 
Tiers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers)
 
-## Usage
+## Getting started
 
 ### Concepts
 
-The Azure Blob Storage data model presents 3 core concepts:
+The Azure Storage data model presents 3 core concepts:
 
 * **Storage Account**: All access is done through a storage account.
 * **Container**: A container is a grouping of multiple blobs.  A storage 
account
   may have multiple containers.  In Hadoop, an entire file system hierarchy is
-  stored in a single container.  It is also possible to configure multiple
-  containers, effectively presenting multiple file systems that can be 
referenced
-  using distinct URLs.
-* **Blob**: A file of any type and size.  In Hadoop, files are stored in blobs.
-  The internal implementation also uses blobs to persist the file system
-  hierarchy and other metadata.
+  stored in a single container.
+* **Blob**: A file of any type and size stored with the existing wasb connector
 
-### Configuring Credentials
+The ABFS connector connects to classic containers, or those created
+with Hierarchical Namespaces.
 
-Usage of Azure Blob Storage requires configuration of credentials.  Typically
-this is set in core-site.xml.  The configuration property name is of the form
-`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is 
the
-access key.  **The access key is a secret that protects access to your storage
-account.  Do not share the access key (or the core-site.xml file) with an
-untrusted party.**
+## <a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility)
 
-For example:
+A key aspect of ADLS Gen 2 is its support for
+[hierarchical 
namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
+These are effectively directories and offer high performance rename and delete 
operations
+—something which makes a significant improvement in performance in query 
engines
+writing data to, including MapReduce, Spark, Hive, as well as DistCp.
 
-```xml
-<property>
-  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
-  <value>YOUR ACCESS KEY</value>
-</property>
-```
-In many Hadoop clusters, the core-site.xml file is world-readable. It is 
possible to
-protect the access key within a credential provider as well. This provides an 
encrypted
-file format along with protection with file permissions.
+This feature is only available if the container was created with "namespace"
+support.
 
-#### Protecting the Azure Credentials for WASB with Credential Providers
+You enable namespace support when creating a new Storage Account,
+by checking the "Hierarchical Namespace" option in the Portal UI, or, when
+creating through the command line, using the option `--hierarchical-namespace 
true`
 
-To protect these credentials from prying eyes, it is recommended that you use
-the credential provider framework to securely store them and access them
-through configuration. The following describes its use for Azure credentials
-in WASB FileSystem.
+_You cannot enable Hierarchical Namespaces on an existing storage account_
 
-For additional reading on the credential provider API see:
-[Credential Provider 
API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
+_**Containers in a storage account with Hierarchical Namespaces are
+not (currently) readable through the `deprecated wasb:` connector.**_
 
-##### End to End Steps for Distcp and WASB with Credential Providers
-
-###### provision
+Some of the `az storage` command line commands fail too, for example:
 
 ```bash
-% hadoop credential create 
fs.azure.account.key.youraccount.blob.core.windows.net -value 123
-    -provider localjceks://file/home/lmccay/wasb.jceks
+$ az storage container list --account-name abfswales1
+Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: 
BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
 ```
 
-###### configure core-site.xml or command line system property
+### <a name="creating"></a> Creating an Azure Storage Account
 
-```xml
-<property>
-  <name>hadoop.security.credential.provider.path</name>
-  <value>localjceks://file/home/lmccay/wasb.jceks</value>
-  <description>Path to interrogate for protected credentials.</description>
-</property>
-```
+The best documentation on getting started with Azure Datalake Gen2 with the
+abfs connector is [Using Azure Data Lake Storage Gen2 with Azure HDInsight 
clusters](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdi-cluster)
+
+It includes instructions to create it from [the Azure command line 
tool](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest),
+which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).
 
-###### distcp
+The [az 
storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest)
 subcommand
+handles all storage commands, [`az storage account 
create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create)
+does the creation.
 
+Until the ADLS gen2 API support is finalized, you need to add an extension
+to the ADLS command.
 ```bash
-% hadoop distcp
-    [-D 
hadoop.security.credential.provider.path=localjceks://file/home/lmccay/wasb.jceks]
-    hdfs://hostname:9001/user/lmccay/007020615 
wasb://yourcontai...@youraccount.blob.core.windows.net/testDir/
+az extension add --name storage-preview
 ```
 
-NOTE: You may optionally add the provider path property to the distcp command 
line instead of
-added job specific configuration to a generic core-site.xml. The square 
brackets above illustrate
-this capability.
+Check that all is well by verifying that the usage command includes 
`--hierarchical-namespace`:
+```
+$  az storage account
+usage: az storage account create [-h] [--verbose] [--debug]
+     [--output {json,jsonc,table,tsv,yaml,none}]
+     [--query JMESPATH] --resource-group
+     RESOURCE_GROUP_NAME --name ACCOUNT_NAME
+     [--sku 
{Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}]
+     [--location LOCATION]
+     [--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}]
+     [--tags [TAGS [TAGS ...]]]
+     [--custom-domain CUSTOM_DOMAIN]
+     [--encryption-services {blob,file,table,queue} [{blob,file,table,queue} 
...]]
+     [--access-tier {Hot,Cool}]
+     [--https-only [{true,false}]]
+     [--file-aad [{true,false}]]
+     [--hierarchical-namespace [{true,false}]]
+     [--bypass {None,Logging,Metrics,AzureServices} 
[{None,Logging,Metrics,AzureServices} ...]]
+     [--default-action {Allow,Deny}]
+     [--assign-identity]
+     [--subscription _SUBSCRIPTION]
+```
 
-#### Protecting the Azure Credentials for WASB within an Encrypted File
+You can list locations from `az account list-locations`, which lists the
+name to refer to in the `--location` argument:
+```
+$ az account list-locations -o table
+
+DisplayName          Latitude    Longitude    Name
+-------------------  ----------  -----------  ------------------
+East Asia            22.267      114.188      eastasia
+Southeast Asia       1.283       103.833      southeastasia
+Central US           41.5908     -93.6208     centralus
+East US              37.3719     -79.8164     eastus
+East US 2            36.6681     -78.3889     eastus2
+West US              37.783      -122.417     westus
+North Central US     41.8819     -87.6278     northcentralus
+South Central US     29.4167     -98.5        southcentralus
+North Europe         53.3478     -6.2597      northeurope
+West Europe          52.3667     4.9          westeurope
+Japan West           34.6939     135.5022     japanwest
+Japan East           35.68       139.77       japaneast
+Brazil South         -23.55      -46.633      brazilsouth
+Australia East       -33.86      151.2094     australiaeast
+Australia Southeast  -37.8136    144.9631     australiasoutheast
+South India          12.9822     80.1636      southindia
+Central India        18.5822     73.9197      centralindia
+West India           19.088      72.868       westindia
+Canada Central       43.653      -79.383      canadacentral
+Canada East          46.817      -71.217      canadaeast
+UK South             50.941      -0.799       uksouth
+UK West              53.427      -3.084       ukwest
+West Central US      40.890      -110.234     westcentralus
+West US 2            47.233      -119.852     westus2
+Korea Central        37.5665     126.9780     koreacentral
+Korea South          35.1796     129.0756     koreasouth
+France Central       46.3772     2.3730       francecentral
+France South         43.8345     2.1972       francesouth
+Australia Central    -35.3075    149.1244     australiacentral
+Australia Central 2  -35.3075    149.1244     australiacentral2
+```
 
-In addition to using the credential provider framework to protect your 
credentials, it's
-also possible to configure it in encrypted form.  An additional configuration 
property
-specifies an external program to be invoked by Hadoop processes to decrypt the
-key.  The encrypted key value is passed to this external program as a command
-line argument:
+Once a location has been chosen, create the account
+```bash
 
-```xml
-<property>
-  <name>fs.azure.account.keyprovider.youraccount</name>
-  <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
-</property>
+az storage account create --verbose \
+    --name abfswales1 \
+    --resource-group devteam2 \
+    --kind StorageV2 \
+    --hierarchical-namespace true \
+    --location ukwest \
+    --sku Standard_LRS \
+    --https-only true \
+    --encryption-services blob \
+    --access-tier Hot \
+    --tags owner=engineering \
+    --assign-identity \
+    --output jsonc
+```
 
-<property>
-  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
-  <value>YOUR ENCRYPTED ACCESS KEY</value>
-</property>
+The output of the command is a JSON file, whose `primaryEndpoints` command
+includes the name of the store endpoint:
+```json
+{
+  "primaryEndpoints": {
+    "blob": "https://abfswales1.blob.core.windows.net/";,
+    "dfs": "https://abfswales1.dfs.core.windows.net/";,
+    "file": "https://abfswales1.file.core.windows.net/";,
+    "queue": "https://abfswales1.queue.core.windows.net/";,
+    "table": "https://abfswales1.table.core.windows.net/";,
+    "web": "https://abfswales1.z35.web.core.windows.net/";
+  }
+}
+```
+
+The `abfswales1.dfs.core.windows.net` account is the name by which the
+storage account will be referred to.
+
+Now ask for the connection string to the store, which contains the account key
+```bash
+az storage account  show-connection-string --name abfswales1
+{
+  "connectionString": 
"DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA=="
+}
+```
 
+You then need to add the access key to your `core-site.xml`, JCEKs file or
+use your cluster management tool to set it the option 
`fs.azure.account.key.STORAGE-ACCOUNT`
+to this value.
+```XML
 <property>
-  <name>fs.azure.shellkeyprovider.script</name>
-  <value>PATH TO DECRYPTION PROGRAM</value>
+  <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
+  
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
 </property>
-
 ```
 
-### Block Blob with Compaction Support and Configuration
+#### Creation through the Azure Portal
 
-Block blobs are the default kind of blob and are good for most big-data use
-cases. However, block blobs have strict limit of 50,000 blocks per blob.
-To prevent reaching the limit WASB, by default, does not upload new block to
-the service after every `hflush()` or `hsync()`.
+Creation through the portal is covered in [Quickstart: Create an Azure Data 
Lake Storage Gen2 storage 
account](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account)
 
-For most of the cases, combining data from multiple `write()` calls in
-blocks of 4Mb is a good optimization. But, in others cases, like HBase log 
files,
-every call to `hflush()` or `hsync()` must upload the data to the service.
+Key Steps
 
-Block blobs with compaction upload the data to the cloud service after every
-`hflush()`/`hsync()`. To mitigate the limit of 50000 blocks, `hflush()
-`/`hsync()` runs once compaction process, if number of blocks in the blob
-is above 32,000.
+1. Create a new Storage Account in a location which suits you.
+1. "Basics" Tab: select "StorageV2".

Review Comment:
   Corrected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to