subject:"Spark querying parquet data partitioned in S3"

Re: Spark querying parquet data partitioned in S3

2017-07-05 Thread Steve Loughran

> On 29 Jun 2017, at 17:44, fran wrote: > > We have got data stored in S3 partitioned by several columns. Let's say > following this hierarchy: > s3://bucket/data/column1=X/column2=Y/parquet-files > > We run a Spark job in a EMR cluster (1 master,3 slaves) and

Spark querying parquet data partitioned in S3

2017-06-30 Thread Francisco Blaya

We have got data stored in S3 partitioned by several columns. Let's say following this hierarchy: s3://bucket/data/column1=X/column2=Y/parquet-files We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the following: A) - When we declare the initial dataframe to be the whole

Spark querying parquet data partitioned in S3

2017-06-29 Thread fran

The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-querying-parquet-data-partitioned-in-S3-tp28809.html Sent from the Apache Spark User List mailing list archive at