Esfandiar Manii created HADOOP-15407:
----------------------------------------
Summary: Support Windows Azure Storage - Blob file system in Hadoop
Key: HADOOP-15407
URL: https://issues.apache.org/jira/browse/HADOOP-15407
Project: Hadoop Common
Issue Type: New Feature
Components: fs/azure
Affects Versions: 3.2.0
Reporter: Esfandiar Manii
Assignee: Esfandiar Manii
{color:#212121}{color:#333333}Description{color}{color}
{color:#212121}This JIRA adds a new file system implementation, ABFS, for
running Big Data and Analytics workloads against Azure Storage. This is a
complete rewrite of the previous WASB driver with a heavy focus on optimizing
both performance and cost.{color}
{color:#212121} {color}
{color:#212121}{color:#333333}High level design{color}{color}
{color:#212121}At a high level, the code here extends the FileSystem class to
provide an implementation for accessing blobs in Azure Storage. The scheme abfs
is used for accessing it over HTTP, and abfss for accessing over HTTPS. The
following URI scheme is used to address individual paths:{color}
{color:#212121} {color}
{color:#212121}abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>{color}
{color:#212121} {color}
{color:#212121}ABFS is intended as a replacement to WASB. WASB is not
deprecated but is in pure maintenance mode and customers should upgrade to ABFS
once it hits General Availability later in CY18.{color}
{color:#212121}Benefits of ABFS include:{color}
{color:#212121}· Higher scale (capacity, throughput, and IOPS) Big Data
and Analytics workloads by allowing higher limits on storage accounts{color}
{color:#212121}· Removing any ramp up time with Storage backend
partitioning; blocks are now automatically sharded across partitions in the
Storage backend{color}
{color:#212121}o This avoids the need for using temporary/intermediate
files, increasing the cost (and framework complexity around committing
jobs/tasks){color}
{color:#212121}· Enabling much higher read and write throughput on
single files (tens of Gbps by default){color}
{color:#212121}· Still retaining all of the Azure Blob features
customers are familiar with and expect, and gaining the benefits of future Blob
features as well{color}
{color:#212121}ABFS incorporates Hadoop Filesystem metrics to monitor the file
system throughput and operations. Ambari metrics are not currently implemented
for ABFS, but will be available soon.{color}
{color:#212121} {color}
{color:#212121}{color:#333333}Credits and history{color}{color}
{color:#212121}Credit for this work goes to (hope I don't forget anyone): Shane
Mainali, {color}{color:#212121}Thomas Marquardt, Zichen Sun, Georgi Chalakov,
Esfandiar Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal,
Saurabh Pant, and James Baker. {color}
{color:#212121}{color:#333333} {color}{color}
{color:#212121}{color:#333333}Test{color}{color}
{color:#212121}ABFS has gone through many test procedures including Hadoop file
system contract tests, unit testing, functional testing, and manual testing.
All the Junit tests provided with the driver are capable of running in both
sequential/parallel fashion in order to reduce the testing time.{color}
{color:#212121}Besides unit tests, we have used ABFS as the default file system
in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a storage
option. (HDFS is also used but not as default file system.) Various different
customer and test workloads have been run against clusters with such
configurations for quite some time. Benchmarks such as Tera*, TPC-DS, Spark
Streaming and Spark SQL, and others have been run to do scenario, performance,
and functional testing. Third parties and customers have also done various
testing of ABFS.{color}
{color:#212121}The current version reflects to the version of the code tested
and used in our production environment.{color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]