Boris Clémençon created SPARK-21797:
----------------------------------------
Summary: spark cannot read partitioned data in S3 that are partly
in glacier
Key: SPARK-21797
URL: https://issues.apache.org/jira/browse/SPARK-21797
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.2.0
Reporter: Boris Clémençon
Priority: Critical
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date
stored in AWS Glacier to save some money. For instance, we have...
{noformat}
s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
{noformat}
I want to read this dataset, but only the a subset of date that are not yet in
glacier, eg:
{code:java}
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
{code}
Unfortunately, I have the exception
{noformat}
java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service: Amazon S3;
Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
{noformat}
I seems that spark does not like partitioned dataset when some partitions are
in Glacier. I could always read specifically each date, add the column with
current date and reduce(_ union _) at the end, but it is ugly like hell and it
should not be necessary.
Is there any tip to read available data in the datastore even with old data in
glacier?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]