Hi, since moving to spark 2.1 from 2.0, we experience a performance regression when reading a large, partitioned parquet dataset:
We observe many (hundreds) very short jobs executing before the job that reads the data is starting. I looked into this issue and pinned it down to PartitioningAwareFileIndex: While recursively listing the directories, if a directory contains more than "spark.sql.sources.parallelPartitionDiscovery.threshold" (default: 32) paths, the children are listed using a spark job. Because the tree is listed serially, this can result in a lot of small spark jobs executed one after the other and the overhead dominates. Performance can be improved by tuning "spark.sql.sources.parallelPartitionDiscovery.threshold". However, this is not a satisfactory solution. I think that the current behaviour could be improved by walking the directory tree in breadth first search order and only launching one spark job to list files in parallel if the number of paths to be listed at some level exceeds spark.sql.sources.parallelPartitionDiscovery.threshold . Does this approach make sense? I have found "Regression in file listing performance" ( https://issues.apache.org/jira/browse/SPARK-18679 ) as the most closely related ticket. Unless there is a reason for the current behaviour, I will create a ticket on this soon. I might have some time in the coming days to work on this. Regards, Bertrand -- Bertrand Bossy | TERALYTICS *software engineer* Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland www.teralytics.net Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.