[
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146193#comment-16146193
]
Steve Loughran commented on SPARK-21797:
----------------------------------------
No> That's a shame. I only came across the option when I pasted the stack trace
in the IDE, and it said "enable this option". sorry, I'm not sure about what
other strategies there are. Sean? Any idea?
> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.2.0
> Environment: Amazon EMR
> Reporter: Boris Clémençon
> Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException:
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> The operation is not valid for the object's storage class (Service: Amazon
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID:
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are
> in Glacier. I could always read specifically each date, add the column with
> current date and reduce(_ union _) at the end, but not pretty and it should
> not be necessary.
> Is there any tip to read available data in the datastore even with old data
> in glacier?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]