[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

JIRA Fri, 25 Aug 2017 01:13:25 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141358#comment-16141358
 ]


Boris Clémençon  commented on SPARK-21797:
------------------------------------------

That's very good news indeed. Easiest way to fix! A more meaningful error would 
be appreciated nonetheless. I will chat with the AWS sdk team to be able to 
tackle the issue in a more orthodox way. 
Thanks!

*Additional points:*
You mentioned "read() takes so long other bits of the system will start to 
think your worker is hanging". According to our test, it is 20% slower maximum 
to read parquet data from S3 than HDFS. Do you agree? And for the price, you 
are right, it can be expensive to read the same data on S3 again and again. In 
our case, beside ML, spark is mostly used for ETL processes and we use Redshift 
for analytics, so there is just one read on S3 per process everyday, so it is 
more advantageous to read once form S3 rather than copying first on HDFS and 
read (now that I have the ultimate option to read partitioned dataset from S3.

Thanks again!

> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: Amazon EMR
>            Reporter: Boris Clémençon 
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

Reply via email to