[jira] [Updated] (HADOOP-15407) Support Windows Azure Storage - Blob file system in Hadoop

Esfandiar Manii (JIRA) Mon, 23 Apr 2018 15:18:23 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Esfandiar Manii updated HADOOP-15407:
-------------------------------------
    Description: 
{color:#333333}Description{color}
This JIRA adds a new file system implementation, ABFS, for running Big Data and 
Analytics workloads against Azure Storage. This is a complete rewrite of the 
previous WASB driver with a heavy focus on optimizing both performance and cost.
{color:#333333}High level design{color}
At a high level, the code here extends the FileSystem class to provide an 
implementation for accessing blobs in Azure Storage. The scheme abfs is used 
for accessing it over HTTP, and abfss for accessing over HTTPS. The following 
URI scheme is used to address individual paths:
abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>
{color:#333333} {color}
ABFS is intended as a replacement to WASB. WASB is not deprecated but is in 
pure maintenance mode and customers should upgrade to ABFS once it hits General 
Availability later in CY18.
Benefits of ABFS include: * Higher scale (capacity, throughput, and IOPS) Big 
Data and Analytics workloads by allowing higher limits on storage accounts
 * Removing any ramp up time with Storage backend partitioning; blocks are now 
automatically sharded across partitions in the Storage backend
 ** This avoids the need for using temporary/intermediate files, increasing the 
cost (and framework complexity around committing jobs/tasks)

 * Enabling much higher read and write throughput on single files (tens of Gbps 
by default)
 * Still retaining all of the Azure Blob features customers are familiar with 
and expect, and gaining the benefits of future Blob features as well

ABFS incorporates Hadoop Filesystem metrics to monitor the file system 
throughput and operations. Ambari metrics are not currently implemented for 
ABFS, but will be available soon.
 
{color:#333333}Credits and history{color}
Credit for this work goes to <add all of our Big Data team>.
{color:#333333}Test{color}
ABFS has gone through many test procedures including Hadoop file system 
contract tests, unit testing, functional testing, and manual testing. All the 
Junit tests provided with the driver are capable of running in both 
sequential/parallel fashion in order to reduce the testing time.
Besides unit tests, we have used ABFS as the default file system in Azure 
HDInsight. Azure HDInsight will very soon offer ABFS as a storage option. (HDFS 
is also used but not as default file system.) Various different customer and 
test workloads have been run against clusters with such configurations for 
quite some time. Benchmarks such as Tera*, TPC-DS, Spark Streaming and Spark 
SQL, and others have been run to do scenario, performance, and functional 
testing. Third parties and customers have also done various testing of ABFS.
The current version reflects to the version of the code tested and used in our 
production environment.

  was:
{color:#212121}{color:#333333}Description{color}{color}
{color:#212121}This JIRA adds a new file system implementation, ABFS, for 
running Big Data and Analytics workloads against Azure Storage. This is a 
complete rewrite of the previous WASB driver with a heavy focus on optimizing 
both performance and cost.{color}
{color:#212121} {color}
{color:#212121}{color:#333333}High level design{color}{color}
{color:#212121}At a high level, the code here extends the FileSystem class to 
provide an implementation for accessing blobs in Azure Storage. The scheme abfs 
is used for accessing it over HTTP, and abfss for accessing over HTTPS. The 
following URI scheme is used to address individual paths:{color}
{color:#212121} {color}
{color:#212121}abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>{color}
{color:#212121} {color}
{color:#212121}ABFS is intended as a replacement to WASB. WASB is not 
deprecated but is in pure maintenance mode and customers should upgrade to ABFS 
once it hits General Availability later in CY18.{color}
{color:#212121}Benefits of ABFS include:{color}
{color:#212121}·         Higher scale (capacity, throughput, and IOPS) Big Data 
and Analytics workloads by allowing higher limits on storage accounts{color}
{color:#212121}·         Removing any ramp up time with Storage backend 
partitioning; blocks are now automatically sharded across partitions in the 
Storage backend{color}
{color:#212121}o    This avoids the need for using temporary/intermediate 
files, increasing the cost (and framework complexity around committing 
jobs/tasks){color}
{color:#212121}·         Enabling much higher read and write throughput on 
single files (tens of Gbps by default){color}
{color:#212121}·         Still retaining all of the Azure Blob features 
customers are familiar with and expect, and gaining the benefits of future Blob 
features as well{color}
{color:#212121}ABFS incorporates Hadoop Filesystem metrics to monitor the file 
system throughput and operations. Ambari metrics are not currently implemented 
for ABFS, but will be available soon.{color}
{color:#212121} {color}
{color:#212121}{color:#333333}Credits and history{color}{color}
{color:#212121}Credit for this work goes to (hope I don't forget anyone): Shane 
Mainali, {color}{color:#212121}Thomas Marquardt, Zichen Sun, Georgi Chalakov, 
Esfandiar Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal, 
Saurabh Pant, and James Baker. {color}
{color:#212121}{color:#333333} {color}{color}
{color:#212121}{color:#333333}Test{color}{color}
{color:#212121}ABFS has gone through many test procedures including Hadoop file 
system contract tests, unit testing, functional testing, and manual testing. 
All the Junit tests provided with the driver are capable of running in both 
sequential/parallel fashion in order to reduce the testing time.{color}
{color:#212121}Besides unit tests, we have used ABFS as the default file system 
in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a storage 
option. (HDFS is also used but not as default file system.) Various different 
customer and test workloads have been run against clusters with such 
configurations for quite some time. Benchmarks such as Tera*, TPC-DS, Spark 
Streaming and Spark SQL, and others have been run to do scenario, performance, 
and functional testing. Third parties and customers have also done various 
testing of ABFS.{color}
{color:#212121}The current version reflects to the version of the code tested 
and used in our production environment.{color}


> Support Windows Azure Storage - Blob file system in Hadoop
> ----------------------------------------------------------
>
>                 Key: HADOOP-15407
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15407
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/azure
>    Affects Versions: 3.2.0
>            Reporter: Esfandiar Manii
>            Assignee: Esfandiar Manii
>            Priority: Major
>
> {color:#333333}Description{color}
> This JIRA adds a new file system implementation, ABFS, for running Big Data 
> and Analytics workloads against Azure Storage. This is a complete rewrite of 
> the previous WASB driver with a heavy focus on optimizing both performance 
> and cost.
> {color:#333333}High level design{color}
> At a high level, the code here extends the FileSystem class to provide an 
> implementation for accessing blobs in Azure Storage. The scheme abfs is used 
> for accessing it over HTTP, and abfss for accessing over HTTPS. The following 
> URI scheme is used to address individual paths:
> abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>
> {color:#333333} {color}
> ABFS is intended as a replacement to WASB. WASB is not deprecated but is in 
> pure maintenance mode and customers should upgrade to ABFS once it hits 
> General Availability later in CY18.
> Benefits of ABFS include: * Higher scale (capacity, throughput, and IOPS) Big 
> Data and Analytics workloads by allowing higher limits on storage accounts
>  * Removing any ramp up time with Storage backend partitioning; blocks are 
> now automatically sharded across partitions in the Storage backend
>  ** This avoids the need for using temporary/intermediate files, increasing 
> the cost (and framework complexity around committing jobs/tasks)
>  * Enabling much higher read and write throughput on single files (tens of 
> Gbps by default)
>  * Still retaining all of the Azure Blob features customers are familiar with 
> and expect, and gaining the benefits of future Blob features as well
> ABFS incorporates Hadoop Filesystem metrics to monitor the file system 
> throughput and operations. Ambari metrics are not currently implemented for 
> ABFS, but will be available soon.
>  
> {color:#333333}Credits and history{color}
> Credit for this work goes to <add all of our Big Data team>.
> {color:#333333}Test{color}
> ABFS has gone through many test procedures including Hadoop file system 
> contract tests, unit testing, functional testing, and manual testing. All the 
> Junit tests provided with the driver are capable of running in both 
> sequential/parallel fashion in order to reduce the testing time.
> Besides unit tests, we have used ABFS as the default file system in Azure 
> HDInsight. Azure HDInsight will very soon offer ABFS as a storage option. 
> (HDFS is also used but not as default file system.) Various different 
> customer and test workloads have been run against clusters with such 
> configurations for quite some time. Benchmarks such as Tera*, TPC-DS, Spark 
> Streaming and Spark SQL, and others have been run to do scenario, 
> performance, and functional testing. Third parties and customers have also 
> done various testing of ABFS.
> The current version reflects to the version of the code tested and used in 
> our production environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-15407) Support Windows Azure Storage - Blob file system in Hadoop

Reply via email to