Re: Spark querying parquet data partitioned in S3

2017-07-05 Thread Steve Loughran

> On 29 Jun 2017, at 17:44, fran  wrote:
> 
> We have got data stored in S3 partitioned by several columns. Let's say
> following this hierarchy:
> s3://bucket/data/column1=X/column2=Y/parquet-files
> 
> We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
> following:
> 
> A) - When we declare the initial dataframe to be the whole dataset (val df =
> sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job
> into several tasks (259) that are performed by the executors and we believe
> the driver gets back the parquet metadata.
> 
> Question: The above takes about 25 minutes for our dataset, we believe it
> should be a lazy query (as we are not performing any actions) however it
> looks like something is happening, all the executors are reading from S3. We
> have tried mergeData=false and setting the schema explicitly via
> .schema(someSchema). Is there any way to speed this up?
> 
> B) - When we declare the initial dataframe to be scoped by the first column
> (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems
> that all the work (getting the parquet metadata) is done by the driver and
> there is no job submitted to Spark. 
> 
> Question: Why does (A) send the work to executors but (B) does not?
> 
> The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.
> 
> 

Split calculation can be very slow against object stores, especially if the 
directory structure is deep: the treewalking done here is pretty inefficient 
against the object store.

Then there's the schema merge, which looks at the tail of every file, so has to 
do a seek() against all of them. That is something which it parallelises around 
the cluster, before your job actually gets scheduled. 

Turning that off with spark.sql.parquet.mergeSchema = false should make it go 
away, but clearly not.

Aa call to jstack against the driver will show where it is at: you'll probably 
have to start from there


I know if you are using EMR you are stuck using Amazon's own s3 ciients; if you 
were on Apache's own artifacts you could move up to Hadoop 2.8 and set the 
spark.hadoop.fs.s3a.experimental.fadvise=random option for high speed random 
access. You can also turn off job summary creation in Spark


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark querying parquet data partitioned in S3

2017-06-30 Thread Francisco Blaya
We have got data stored in S3 partitioned by several columns. Let's say
following this hierarchy:
s3://bucket/data/column1=X/column2=Y/parquet-files

We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
following:

A) - When we declare the initial dataframe to be the whole dataset (val df
= sqlContext.read.parquet("s3://bucket/data/) then the driver splits the
job into several tasks (259) that are performed by the executors and we
believe the driver gets back the parquet metadata.

Question: The above takes about 25 minutes for our dataset, we believe it
should be a lazy query (as we are not performing any actions) however it
looks like something is happening, all the executors are reading from S3.
We have tried mergeData=false and setting the schema explicitly via
.schema(someSchema). Is there any way to speed this up?

B) - When we declare the initial dataframe to be scoped by the first column
(val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it
seems that all the work (getting the parquet metadata) is done by the
driver and there is no job submitted to Spark.

Question: Why does (A) send the work to executors but (B) does not?

The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.

-- 
hivehome.com 



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and 
intended only for the use of the individual(s) to which it is addressed. It 
may contain information which is confidential and/or covered by legal 
professional or other privilege. The views expressed in this email are not 
necessarily the views of Centrica plc, and the company, its directors, 
officers or employees make no representation or accept any liability for 
their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in 
England and Wales with its registered office at Millstream, Maidenhead 
Road, Windsor, Berkshire SL4 5GD.


Spark querying parquet data partitioned in S3

2017-06-29 Thread fran
We have got data stored in S3 partitioned by several columns. Let's say
following this hierarchy:
s3://bucket/data/column1=X/column2=Y/parquet-files

We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
following:

A) - When we declare the initial dataframe to be the whole dataset (val df =
sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job
into several tasks (259) that are performed by the executors and we believe
the driver gets back the parquet metadata.

Question: The above takes about 25 minutes for our dataset, we believe it
should be a lazy query (as we are not performing any actions) however it
looks like something is happening, all the executors are reading from S3. We
have tried mergeData=false and setting the schema explicitly via
.schema(someSchema). Is there any way to speed this up?

B) - When we declare the initial dataframe to be scoped by the first column
(val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems
that all the work (getting the parquet metadata) is done by the driver and
there is no job submitted to Spark. 

Question: Why does (A) send the work to executors but (B) does not?

The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-querying-parquet-data-partitioned-in-S3-tp28809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org