Re: Performance regression for partitioned parquet data

Bertrand Bossy Thu, 15 Jun 2017 05:51:08 -0700

Hi,

I created https://issues.apache.org/jira/browse/SPARK-21056 and proposed an
implementation here: https://github.com/apache/spark/pull/18269


I'll try to address cloud-fan's comment ASAP

Any input welcome.

Regards,
Bertrand

On Thu, Jun 15, 2017 at 1:27 AM, Mike Wheeler <rotationsymmetr...@gmail.com>
wrote:

> I might have a similar problem:
>
> in the spark-shell:
> val data = spark.read.parquet("...")
>
> after hitting enter, it takes more than 30 seconds for the "read" to
> complete and return the command line. I am running Spark 2.1.1. But I have
> also tested it on 2.0.2 and encountered the same issue.
>
> thanks,
>
> Mike
>
>
>
> On Tue, Jun 13, 2017 at 10:05 AM, Michael Allman <mich...@videoamp.com>
> wrote:
>
>> Hi Bertrand,
>>
>> I encourage you to create a ticket for this and submit a PR if you have
>> time. Please add me as a listener, and I'll try to contribute/review.
>>
>> Michael
>>
>> On Jun 6, 2017, at 5:18 AM, Bertrand Bossy <bertrand.bo...@teralytics.ch>
>> wrote:
>>
>> Hi,
>>
>> since moving to spark 2.1 from 2.0, we experience a performance
>> regression when reading a large, partitioned parquet dataset:
>>
>> We observe many (hundreds) very short jobs executing before the job that
>> reads the data is starting. I looked into this issue and pinned it down to
>> PartitioningAwareFileIndex: While recursively listing the directories, if a
>> directory contains more than "spark.sql.sources.parall
>> elPartitionDiscovery.threshold" (default: 32) paths, the children are
>> listed using a spark job. Because the tree is listed serially, this can
>> result in a lot of small spark jobs executed one after the other and the
>> overhead dominates. Performance can be improved by tuning
>> "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this
>> is not a satisfactory solution.
>>
>> I think that the current behaviour could be improved by walking the
>> directory tree in breadth first search order and only launching one spark
>> job to list files in parallel if the number of paths to be listed at some
>> level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .
>>
>> Does this approach make sense? I have found "Regression in file listing
>> performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as
>> the most closely related ticket.
>>
>> Unless there is a reason for the current behaviour, I will create a
>> ticket on this soon. I might have some time in the coming days to work on
>> this.
>>
>> Regards,
>> Bertrand
>>
>> --
>>
>> Bertrand Bossy | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>>
>>
>>
>


-- 

Bertrand Bossy | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41 78 821 95 00
email: bertrand.bo...@teralytics.net
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Re: Performance regression for partitioned parquet data

Reply via email to