Re: [PR] HADOOP-19515. ABFS: Updating Documentations of Hadoop Drivers for Azure [hadoop]

via GitHub Thu, 03 Apr 2025 21:11:35 -0700


manika137 commented on code in PR #7540:
URL: https://github.com/apache/hadoop/pull/7540#discussion_r2028056018



##########
hadoop-tools/hadoop-azure/src/site/markdown/index.md:
##########
@@ -12,553 +12,1479 @@
   limitations under the License. See accompanying LICENSE file.
 -->
 
-# Hadoop Azure Support: Azure Blob Storage
+# Hadoop Azure Support: ABFS  - Azure Data Lake Storage Gen2
 
 <!-- MACRO{toc|fromDepth=1|toDepth=3} -->
 
-See also:
-
-* [WASB](./wasb.html)
-* [ABFS](./abfs.html)
-* [Namespace Disabled Accounts on ABFS](./fns_blob.html)
-* [Testing](./testing_azure.html)
-
-## Introduction
+## <a name="introduction"></a> Introduction
 
-The `hadoop-azure` module provides support for integration with
-[Azure Blob 
Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
-The built jar file, named `hadoop-azure.jar`, also declares transitive 
dependencies
-on the additional artifacts it requires, notably the
-[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
+The `hadoop-azure` module provides support for the Azure Data Lake Storage Gen2
+storage layer through the "abfs" connector
 
-To make it part of Apache Hadoop's default classpath, simply make sure that
-`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list.
-Example:
+To make it part of Apache Hadoop's default classpath, make sure that
+`HADOOP_OPTIONAL_TOOLS` environment variable has `hadoop-azure` in the list,
+*on every machine in the cluster*
 
 ```bash
-export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
+export HADOOP_OPTIONAL_TOOLS=hadoop-azure
 ```
-## Features
 
-* Read and write data stored in an Azure Blob Storage account.
-* Present a hierarchical file system view by implementing the standard Hadoop
+You can set this locally in your `.profile`/`.bashrc`, but note it won't
+propagate to jobs running in-cluster.
+
+See also:
+* [FNS (non-HNS)](./fns_blob.html)
+* [Legacy-Deprecated-WASB](./wasb.html)
+* [Testing](./testing_azure.html)
+
+## <a name="features"></a> Features of the ABFS connector.
+
+* Supports reading and writing data stored in an Azure Blob Storage account.
+* *Fully Consistent* view of the storage across all clients.
+* Can read data written through the ` deprecated wasb:` connector.
+* Presents a hierarchical file system view by implementing the standard Hadoop
   [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
 * Supports configuration of multiple Azure Blob Storage accounts.
-* Supports both block blobs (suitable for most use cases, such as MapReduce) 
and
-  page blobs (suitable for continuous write use cases, such as an HBase
-  write-ahead log).
-* Reference file system paths using URLs using the `wasb` scheme.
-* Also reference file system paths using URLs with the `wasbs` scheme for SSL
-  encrypted access.
-* Can act as a source of data in a MapReduce job, or a sink.
-* Tested on both Linux and Windows.
-* Tested at scale.
-
-## Limitations
-
-* File owner and group are persisted, but the permissions model is not 
enforced.
-  Authorization occurs at the level of the entire Azure Blob Storage account.
-* File last access time is not tracked.
+* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, 
Apache Spark.
+* Tested at scale on both Linux and Windows by Microsoft themselves.
+* Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure 
infrastructure.
+
+For details on ABFS, consult the following documents:
+
+* [A closer look at Azure Data Lake Storage 
Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/);
+MSDN Article from June 28, 2018.
+* [Storage 
Tiers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers)
 
-## Usage
+## Getting started
 
 ### Concepts
 
-The Azure Blob Storage data model presents 3 core concepts:
+The Azure Storage data model presents 3 core concepts:
 
 * **Storage Account**: All access is done through a storage account.
 * **Container**: A container is a grouping of multiple blobs.  A storage 
account
   may have multiple containers.  In Hadoop, an entire file system hierarchy is
-  stored in a single container.  It is also possible to configure multiple
-  containers, effectively presenting multiple file systems that can be 
referenced
-  using distinct URLs.
-* **Blob**: A file of any type and size.  In Hadoop, files are stored in blobs.
-  The internal implementation also uses blobs to persist the file system
-  hierarchy and other metadata.
+  stored in a single container.
+* **Blob**: A file of any type and size stored with the existing wasb connector

Review Comment:
   Corrected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] HADOOP-19515. ABFS: Updating Documentations of Hadoop Drivers for Azure [hadoop]

Reply via email to