[jira] [Updated] (HADOOP-16259) Distcp to set S3 Storage Class

Steve Loughran (JIRA) Tue, 16 Apr 2019 10:43:00 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-16259:
------------------------------------
    Docs Text: 
ENHANCE HADOOP DISTCP FOR CUSTOM S3 STORAGE CLASS
Problem statement:
        Hadoop distcp implementation doesn’t have properties to override 
Storage class while transferring data to Amazon S3 storage. Hadoop distcp 
doesn’t set any storage class while transferring data to Amazon S3 storage. Due 
to this all the objects moved from cluster to S3 using Hadoop Distcp are been 
stored in the default storage class “STANDARD”. 
        Due to this limitation of Hadoop distcp, the clusters heavily dependent 
on distcp to transfer data to S3 are forced to PUT objects under high cost 
“STANDARD” storage class and use S3 lifecycle policies to transition the data 
to cost effective archive layer like “GLACIER”.
        This will contribute to considerable increase in billing as data is 
been staged in “STANDARD” layer before transitioning to “GLACIER” layer even 
for use cases where archival is the only business need. 
        This problem can be rectified by implementing the below changes in the 
hadoop-aws-x.x.x.jar.
Design :
        The hadoop-aws jar is part of hadoop distribution and provide s3,s3n 
and s3a protocols to access objects stored in S3. In order to enable the 
storage class override feature for all the 3 protocols, we have to implement 
the changes in each protocols as mentioned below,
Note : Based on the hadoop version of the cluster, we have to get the 
appropriate source code version of hadoop-aws-x.x.x.jar from apache download 
site.
We will introduce a storage class property “fs.s3.storage.class”. This property 
will be defaulted to “STANDARD”. But will enable the feature to override this 
property to any of the valid S3 storage classes (STANDARD | REDUCED_REDUNDANCY 
| GLACIER | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | DEEP_ARCHIVE).
S3A:
1.      In the class “Constants” under the package “org.apache.hadoop.fs.s3a”, 
initialize the storage class property and default storage class properties as 
below,,
a.      public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b.       public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
c.      public static final String 
S3_STORAGE_CLASS_HEADER="x-amz-storage-class";
2.      In the class “S3AOutputStream” under the package 
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and 
initialize this property in the constructor as below ,
a.      this.s3StorageClass=conf.get(S3_STORAGE_CLASS, 
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf 
and check whether any override value is been provided for the property 
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while 
uploading the object to S3 or else the default value “STANDARD” will be used.
b.      Then in the close method, set the storage class of the initialized S3 
object  in the object metadata object as follows,
i.      om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
3.      In the class “S3AFastOutputStream” under the package 
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and 
initialize this property in the constructor as below ,
a.      this.s3StorageClass= fs.getConf().get(S3_STORAGE_CLASS, 
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf 
and check whether any override value is been provided for the property 
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while 
uploading the object to S3 or else the default value “STANDARD” will be used.
b.      Then in the close method, set the storage class of the initialized S3 
object  in the object metadata object as follows,
i.      om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);


Advantage:
•       Objects can be uploaded directly into GLACIER storage class which have 
considerable reduction in the billing by eliminating unneeded staging of data 
in STANDARD layer.


  was:
ENHANCE HADOOP DISTCP FOR CUSTOM S3 STORAGE CLASS
Problem statement:
        Hadoop distcp implementation doesn’t have properties to override 
Storage class while transferring data to Amazon S3 storage. Hadoop distcp 
doesn’t set any storage class while transferring data to Amazon S3 storage. Due 
to this all the objects moved from cluster to S3 using Hadoop Distcp are been 
stored in the default storage class “STANDARD”. 
        Due to this limitation of Hadoop distcp, the clusters heavily dependent 
on distcp to transfer data to S3 are forced to PUT objects under high cost 
“STANDARD” storage class and use S3 lifecycle policies to transition the data 
to cost effective archive layer like “GLACIER”.
        This will contribute to considerable increase in billing as data is 
been staged in “STANDARD” layer before transitioning to “GLACIER” layer even 
for use cases where archival is the only business need. 
        This problem can be rectified by implementing the below changes in the 
hadoop-aws-x.x.x.jar.
Design :
        The hadoop-aws jar is part of hadoop distribution and provide s3,s3n 
and s3a protocols to access objects stored in S3. In order to enable the 
storage class override feature for all the 3 protocols, we have to implement 
the changes in each protocols as mentioned below,
Note : Based on the hadoop version of the cluster, we have to get the 
appropriate source code version of hadoop-aws-x.x.x.jar from apache download 
site.
We will introduce a storage class property “fs.s3.storage.class”. This property 
will be defaulted to “STANDARD”. But will enable the feature to override this 
property to any of the valid S3 storage classes (STANDARD | REDUCED_REDUNDANCY 
| GLACIER | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | DEEP_ARCHIVE).
S3 :
1.      In the class “S3FileSystemConfigKeys” under the package 
“org.apache.hadoop.fs.s3”, initialize the storage class property and default 
storage class properties as below,,
a.      public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b.        public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
2.      In the class “Jets3tFileSystemStore” under the package 
“org.apache.hadoop.fs.s3”, introduce a “s3StorageClass” property and initialize 
this property in the constructor as below ,
a.      this.s3StorageClass=conf.get(S3FileSystemConfigKeys.S3_STORAGE_CLASS, 
S3FileSystemConfigKeys.S3_STORAGE_CLASS_DEFAULT); - This will read the conf 
passed from hadoop conf and check whether any override value is been provided 
for the property “org.apache.hadoop.fs.s3”. If there is any, then it will be 
used while uploading the object to S3 or else the default value “STANDARD” will 
be used.
b.      Then in the put()  method, set the storage class of the initialized S3 
object as follows,
i.      object.setStorageClass(this.s3StorageClass);

S3N:
1.      In the class “S3NativeFileSystemConfigKeys” under the package 
“org.apache.hadoop.fs.s3native”, initialize the storage class property and 
default storage class properties as below,,
a.      public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b.        public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
2.      In the class “Jets3tNativeFileSystemStore” under the package 
“org.apache.hadoop.fs.s3native”, introduce a “s3StorageClass” property and 
initialize this property in the constructor as below ,
a.      conf.get(S3NativeFileSystemConfigKeys.S3_STORAGE_CLASS, 
S3NativeFileSystemConfigKeys.S3_STORAGE_CLASS_DEFAULT); - This will read the 
conf passed from hadoop conf and check whether any override value is been 
provided for the property “org.apache.hadoop.fs.s3”. If there is any, then it 
will be used while uploading the object to S3 or else the default value 
“STANDARD” will be used.
b.      Then in the methods storeFile , storeLargeFile and storeEmptyFile, set 
the storage class of the initialized S3 object as follows,
i.      object.setStorageClass(this.s3StorageClass);
S3A:
1.      In the class “Constants” under the package “org.apache.hadoop.fs.s3a”, 
initialize the storage class property and default storage class properties as 
below,,
a.      public static final String S3_STORAGE_CLASS="fs.s3.storage.class";
b.       public static final String S3_STORAGE_CLASS_DEFAULT="STANDARD";
c.      public static final String 
S3_STORAGE_CLASS_HEADER="x-amz-storage-class";
2.      In the class “S3AOutputStream” under the package 
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and 
initialize this property in the constructor as below ,
a.      this.s3StorageClass=conf.get(S3_STORAGE_CLASS, 
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf 
and check whether any override value is been provided for the property 
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while 
uploading the object to S3 or else the default value “STANDARD” will be used.
b.      Then in the close method, set the storage class of the initialized S3 
object  in the object metadata object as follows,
i.      om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);
3.      In the class “S3AFastOutputStream” under the package 
“org.apache.hadoop.fs.s3a”, introduce a “s3StorageClass” property and 
initialize this property in the constructor as below ,
a.      this.s3StorageClass= fs.getConf().get(S3_STORAGE_CLASS, 
S3_STORAGE_CLASS_DEFAULT); - This will read the conf passed from hadoop conf 
and check whether any override value is been provided for the property 
“org.apache.hadoop.fs.s3”. If there is any, then it will be used while 
uploading the object to S3 or else the default value “STANDARD” will be used.
b.      Then in the close method, set the storage class of the initialized S3 
object  in the object metadata object as follows,
i.      om.setHeader(S3_STORAGE_CLASS_HEADER, this.s3StorageClass);


Advantage:
•       Objects can be uploaded directly into GLACIER storage class which have 
considerable reduction in the billing by eliminating unneeded staging of data 
in STANDARD layer.



> Distcp to set S3 Storage Class
> ------------------------------
>
>                 Key: HADOOP-16259
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16259
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: hadoop-aws
>    Affects Versions: 2.8.4
>            Reporter: Prakash Gopalsamy
>            Priority: Minor
>              Labels: aws-s3, distcp
>         Attachments: ENHANCE_HADOOP_DISTCP_FOR_CUSTOM_S3_STORAGE_CLASS.docx
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hadoop distcp implementation doesn’t have properties to override Storage 
> class while transferring data to Amazon S3 storage. Hadoop distcp doesn’t set 
> any storage class while transferring data to Amazon S3 storage. Due to this 
> all the objects moved from cluster to S3 using Hadoop Distcp are been stored 
> in the default storage class “STANDARD”. By providing a new feature to 
> override the default S3 storage class through configuration properties will 
> be helpful to upload objects in other storage classes. I have come up with a 
> design to implement this feature in a design document and uploaded the same 
> in the JIRA. Kindly review and let me know for your suggestions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-16259) Distcp to set S3 Storage Class

Reply via email to