Hi,

since moving to spark 2.1 from 2.0, we experience a performance regression
when reading a large, partitioned parquet dataset:

We observe many (hundreds) very short jobs executing before the job that
reads the data is starting. I looked into this issue and pinned it down to
PartitioningAwareFileIndex: While recursively listing the directories, if a
directory contains more
than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32)
paths, the children are listed using a spark job. Because the tree is
listed serially, this can result in a lot of small spark jobs executed one
after the other and the overhead dominates. Performance can be improved by
tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However,
this is not a satisfactory solution.

I think that the current behaviour could be improved by walking the
directory tree in breadth first search order and only launching one spark
job to list files in parallel if the number of paths to be listed at some
level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold .

Does this approach make sense? I have found "Regression in file listing
performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the
most closely related ticket.

Unless there is a reason for the current behaviour, I will create a ticket
on this soon. I might have some time in the coming days to work on this.

Regards,
Bertrand

-- 

Bertrand Bossy | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Reply via email to