[jira] [Commented] (HADOOP-16259) Distcp to set S3 Storage Class

Steve Loughran (JIRA) Wed, 17 Apr 2019 10:18:09 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820323#comment-16820323
 ]


Steve Loughran commented on HADOOP-16259:
-----------------------------------------

# no need to worry about s3n, s3, both are gone from trunk
# and yes, your work will have to go into trunk

Ther
* See HADOOP-12020 for reduced storage layer discussion; fairly 
straightforward. If you want to add that option *with tests*, welcome to do it. 
* HADOOP-14837 covers identifying glaciated files, with the proposal for 
{{BlockLocation}} instances made on S3 to add the storage class.

glacier is trouble as the files LIST and head GET fails. So how do you deal 
with it in queries? Not directly my probem (hive, spark). I'll give them the 
visibility into the issue, so they can choose to react

How do you upload data to Glacier? If you want to do it direct, you can't use 
the S3 client, you need {{AmazonGlacierClient}}. See 
https://docs.aws.amazon.com/amazonglacier/latest/dev/getting-started-upload-archive.html
 for details. This makes it a much more complex operation than "just" setting 
the x-amz-storage-class"; 

To repeat: you cannot upload data to S3 and say that it goes direct to glacier. 
Instead you upload as normal and set a lifecycle on the bucket to archive after 
24h. And because of that, with a 24h delay, you can get your data up.


Returning to your distcp proposal then, 
* Upload to glacier. WONTFIX
* Upload to reduced storage: HADOOP-12020 -please fix for us!

All work has to be in trunk, for distcp to work properly you also need the 
-direct option, which isn't going to be backported. I doubt this will be 
either. Changing target version to 3.2.x. 


> Distcp to set S3 Storage Class
> ------------------------------
>
>                 Key: HADOOP-16259
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16259
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: hadoop-aws
>    Affects Versions: 2.8.4
>            Reporter: Prakash Gopalsamy
>            Priority: Minor
>              Labels: aws-s3, distcp
>         Attachments: ENHANCE_HADOOP_DISTCP_FOR_CUSTOM_S3_STORAGE_CLASS.docx
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hadoop distcp implementation doesn’t have properties to override Storage 
> class while transferring data to Amazon S3 storage. Hadoop distcp doesn’t set 
> any storage class while transferring data to Amazon S3 storage. Due to this 
> all the objects moved from cluster to S3 using Hadoop Distcp are been stored 
> in the default storage class “STANDARD”. By providing a new feature to 
> override the default S3 storage class through configuration properties will 
> be helpful to upload objects in other storage classes. I have come up with a 
> design to implement this feature in a design document and uploaded the same 
> in the JIRA. Kindly review and let me know for your suggestions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-16259) Distcp to set S3 Storage Class

Reply via email to