[
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140138#comment-16140138
]
Steve Loughran commented on SPARK-21797:
----------------------------------------
I was talking about the cost and time of getting data from Glacier. If that's
the only place where data lives, then its slow and expensive. And that's the
bit I'm describing as niche. Given I've been working full time on S3A, I'm
reasonably confident it gets used a lot.
If you talk to data in S3 that has been backed up to glacier, you *wlll get a
403*: According to Jeff Barr himself:
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
bq. If you archive objects using the Glacier storage option, you must inspect
the storage class of an object before you attempt to retrieve it. The customary
GET request will work as expected if the object is stored in S3 Standard or
Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object
is archived in Glacier. In this case, you must use the RESTORE operation
(described below) to make your data available in S3.
bq. You use S3’s new RESTORE operation to access an object archived in Glacier.
As part of the request, you need to specify a retention period in days.
Restoring an object will generally take 3 to 5 hours. Your restored object will
remain in both Glacier and S3’s Reduced Redundancy Storage (RRS) for the
duration of the retention period. At the end of the retention period the
object’s data will be removed from S3; the object will remain in Glacier.
Like I said, I'd be interested in getting the full stack trace if you try to
read this with an S3A client. Not for fixing, but for better reporting.
Probably point them at Jeff's blog entry. Or this JIRA :)
> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.2.0
> Reporter: Boris Clémençon
> Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException:
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> The operation is not valid for the object's storage class (Service: Amazon
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID:
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are
> in Glacier. I could always read specifically each date, add the column with
> current date and reduce(_ union _) at the end, but not pretty and it should
> not be necessary.
> Is there any tip to read available data in the datastore even with old data
> in glacier?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]