RE: Spark Sql behaves strangely with tables with a lot of partitions

Cheng, Hao Wed, 19 Aug 2015 22:44:23 -0700

Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false.

BTW, which version are you using?

Hao

From: Jerrick Hoang [mailto:jerrickho...@gmail.com]
Sent: Thursday, August 20, 2015 12:16 PM
To: Philip Weaver
Cc: user
Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions

I guess the question is why does spark have to do partition discovery with all 
partitions when the query only needs to look at one partition? Is there a conf 
flag to turn this off?

On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver 
<philip.wea...@gmail.com<mailto:philip.wea...@gmail.com>> wrote:
I've had the same problem. It turns out that Spark (specifically parquet) is 
very slow at partition discovery. It got better in 1.5 (not yet released), but 
was still unacceptably slow. Sadly, we ended up reading parquet files manually 
in Python (via C++) and had to abandon Spark SQL because of this problem.

On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang 
<jerrickho...@gmail.com<mailto:jerrickho...@gmail.com>> wrote:
Hi all,

I did a simple experiment with Spark SQL. I created a partitioned parquet table 
with only one partition (date=20140701). A simple `select count(*) from table 
where date=20140701` would run very fast (0.1 seconds). However, as I added 
more partitions the query takes longer and longer. When I added about 10,000 
partitions, the query took way too long. I feel like querying for a single 
partition should not be affected by having more partitions. Is this a known 
behaviour? What does spark try to do here?

Thanks,
Jerrick

RE: Spark Sql behaves strangely with tables with a lot of partitions

Reply via email to