[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

vaquar khan (JIRA) Tue, 02 Apr 2019 22:52:27 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808388#comment-16808388
 ]


vaquar khan commented on SPARK-21797:
-------------------------------------

{quote}Issue is related to AWS storage class not Apache Spark even AES athena 
**  giving same error when try to read it from glacier . .

If you archive objects using the Glacier storage option, you must inspect the 
storage class of an object before you attempt to retrieve it. The customary GET 
request will work as expected if the object is stored in S3 Standard or Reduced 
Redundancy (RRS) storage. It will fail (with a 403 error) if the object is 
archived in Glacier. In this case, you must use the RESTORE operation 
(described below) to make your data available in S3.
{quote} * [https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/]
 

> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: Amazon EMR
>            Reporter: Boris Clémençon 
>            Priority: Major
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

Reply via email to