[
https://issues.apache.org/jira/browse/HADOOP-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-16259:
------------------------------------
Docs Text:
ENHANCE HADOOP DISTCP FOR CUSTOM S3 STORAGE CLASS
Problem statement:
Hadoop distcp implementation doesn’t have properties to override
Storage class while transferring data to Amazon S3 storage. Hadoop distcp
doesn’t set any storage class while transferring data to Amazon S3 storage. Due
to this all the objects moved from cluster to S3 using Hadoop Distcp are been
stored in the default storage class “STANDARD”.
Due to this limitation of Hadoop distcp, the clusters heavily dependent
on distcp to transfer data to S3 are forced to PUT objects under high cost
“STANDARD” storage class and use S3 lifecycle policies to transition the data
to cost effective archive layer like “GLACIER”.
This will contribute to considerable increase in billing as data is
been staged in “STANDARD” layer before transitioning to “GLACIER” layer even
for use cases where archival is the only business need.
This problem can be rectified by implementing the below changes in the
hadoop-aws-x.x.x.jar.
Design :
The hadoop-aws jar is part of hadoop distribution and provide s3,s3n
and s3a protocols to access objects stored in S3. In order to enable the
storage class override feature for all the 3 protocols, we have to implement
the changes in each protocols as mentioned below,
Note : Based on the hadoop version of the cluster, we have to get the
appropriate source code version of hadoop-aws-x.x.x.jar from apache download
site.
We will introduce a storage class property “fs.s3.storage.class”. This property
will be defaulted to “STANDARD”. But will enable the feature to override this
property to any of the valid S3 storage classes (STANDARD | REDUCED_REDUNDANCY
| GLACIER | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | DEEP_ARCHIVE).
S3A:
1. In the class “Constants” under the package “org.apache.hadoop.fs.s3a”,
initialize the storage class property and default storage class properties as
below,,
a. public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b. public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
c. public static final String
S3_STORAGE_CLASS_HEADER="x-amz-storage-class";
2. In the class “S3AOutputStream” under the package
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and
initialize this property in the constructor as below ,
a. this.s3StorageClass=conf.get(S3_STORAGE_CLASS,
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf
and check whether any override value is been provided for the property
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while
uploading the object to S3 or else the default value “STANDARD” will be used.
b. Then in the close method, set the storage class of the initialized S3
object in the object metadata object as follows,
i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
3. In the class “S3AFastOutputStream” under the package
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and
initialize this property in the constructor as below ,
a. this.s3StorageClass= fs.getConf().get(S3_STORAGE_CLASS,
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf
and check whether any override value is been provided for the property
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while
uploading the object to S3 or else the default value “STANDARD” will be used.
b. Then in the close method, set the storage class of the initialized S3
object in the object metadata object as follows,
i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
Advantage:
• Objects can be uploaded directly into GLACIER storage class which have
considerable reduction in the billing by eliminating unneeded staging of data
in STANDARD layer.
was:
ENHANCE HADOOP DISTCP FOR CUSTOM S3 STORAGE CLASS
Problem statement:
Hadoop distcp implementation doesn’t have properties to override
Storage class while transferring data to Amazon S3 storage. Hadoop distcp
doesn’t set any storage class while transferring data to Amazon S3 storage. Due
to this all the objects moved from cluster to S3 using Hadoop Distcp are been
stored in the default storage class “STANDARD”.
Due to this limitation of Hadoop distcp, the clusters heavily dependent
on distcp to transfer data to S3 are forced to PUT objects under high cost
“STANDARD” storage class and use S3 lifecycle policies to transition the data
to cost effective archive layer like “GLACIER”.
This will contribute to considerable increase in billing as data is
been staged in “STANDARD” layer before transitioning to “GLACIER” layer even
for use cases where archival is the only business need.
This problem can be rectified by implementing the below changes in the
hadoop-aws-x.x.x.jar.
Design :
The hadoop-aws jar is part of hadoop distribution and provide s3,s3n
and s3a protocols to access objects stored in S3. In order to enable the
storage class override feature for all the 3 protocols, we have to implement
the changes in each protocols as mentioned below,
Note : Based on the hadoop version of the cluster, we have to get the
appropriate source code version of hadoop-aws-x.x.x.jar from apache download
site.
We will introduce a storage class property “fs.s3.storage.class”. This property
will be defaulted to “STANDARD”. But will enable the feature to override this
property to any of the valid S3 storage classes (STANDARD | REDUCED_REDUNDANCY
| GLACIER | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | DEEP_ARCHIVE).
S3 :
1. In the class “S3FileSystemConfigKeys” under the package
“org.apache.hadoop.fs.s3”, initialize the storage class property and default
storage class properties as below,,
a. public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b. public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
2. In the class “Jets3tFileSystemStore” under the package
“org.apache.hadoop.fs.s3”, introduce a “s3StorageClass” property and initialize
this property in the constructor as below ,
a. this.s3StorageClass=conf.get(S3FileSystemConfigKeys.S3_STORAGE_CLASS,
S3FileSystemConfigKeys.S3_STORAGE_CLASS_DEFAULT); - This will read the conf
passed from hadoop conf and check whether any override value is been provided
for the property “org.apache.hadoop.fs.s3”. If there is any, then it will be
used while uploading the object to S3 or else the default value “STANDARD” will
be used.
b. Then in the put() method, set the storage class of the initialized S3
object as follows,
i. object.setStorageClass(this.s3StorageClass);
S3N:
1. In the class “S3NativeFileSystemConfigKeys” under the package
“org.apache.hadoop.fs.s3native”, initialize the storage class property and
default storage class properties as below,,
a. public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b. public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
2. In the class “Jets3tNativeFileSystemStore” under the package
“org.apache.hadoop.fs.s3native”, introduce a “s3StorageClass” property and
initialize this property in the constructor as below ,
a. conf.get(S3NativeFileSystemConfigKeys.S3_STORAGE_CLASS,
S3NativeFileSystemConfigKeys.S3_STORAGE_CLASS_DEFAULT); - This will read the
conf passed from hadoop conf and check whether any override value is been
provided for the property “org.apache.hadoop.fs.s3”. If there is any, then it
will be used while uploading the object to S3 or else the default value
“STANDARD” will be used.
b. Then in the methods storeFile , storeLargeFile and storeEmptyFile, set
the storage class of the initialized S3 object as follows,
i. object.setStorageClass(this.s3StorageClass);
S3A:
1. In the class “Constants” under the package “org.apache.hadoop.fs.s3a”,
initialize the storage class property and default storage class properties as
below,,
a. public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b. public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
c. public static final String
S3_STORAGE_CLASS_HEADER="x-amz-storage-class";
2. In the class “S3AOutputStream” under the package
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and
initialize this property in the constructor as below ,
a. this.s3StorageClass=conf.get(S3_STORAGE_CLASS,
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf
and check whether any override value is been provided for the property
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while
uploading the object to S3 or else the default value “STANDARD” will be used.
b. Then in the close method, set the storage class of the initialized S3
object in the object metadata object as follows,
i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
3. In the class “S3AFastOutputStream” under the package
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and
initialize this property in the constructor as below ,
a. this.s3StorageClass= fs.getConf().get(S3_STORAGE_CLASS,
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf
and check whether any override value is been provided for the property
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while
uploading the object to S3 or else the default value “STANDARD” will be used.
b. Then in the close method, set the storage class of the initialized S3
object in the object metadata object as follows,
i. om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
Advantage:
• Objects can be uploaded directly into GLACIER storage class which have
considerable reduction in the billing by eliminating unneeded staging of data
in STANDARD layer.
> Distcp to set S3 Storage Class
> ------------------------------
>
> Key: HADOOP-16259
> URL: https://issues.apache.org/jira/browse/HADOOP-16259
> Project: Hadoop Common
> Issue Type: New Feature
> Components: hadoop-aws
> Affects Versions: 2.8.4
> Reporter: Prakash Gopalsamy
> Priority: Minor
> Labels: aws-s3, distcp
> Attachments: ENHANCE_HADOOP_DISTCP_FOR_CUSTOM_S3_STORAGE_CLASS.docx
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Hadoop distcp implementation doesn’t have properties to override Storage
> class while transferring data to Amazon S3 storage. Hadoop distcp doesn’t set
> any storage class while transferring data to Amazon S3 storage. Due to this
> all the objects moved from cluster to S3 using Hadoop Distcp are been stored
> in the default storage class “STANDARD”. By providing a new feature to
> override the default S3 storage class through configuration properties will
> be helpful to upload objects in other storage classes. I have come up with a
> design to implement this feature in a design document and uploaded the same
> in the JIRA. Kindly review and let me know for your suggestions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]