Bertrand Bossy created SPARK-21056:
--------------------------------------
Summary: InMemoryFileIndex.listLeafFiles should create at most one
spark job when listing files in parallel
Key: SPARK-21056
URL: https://issues.apache.org/jira/browse/SPARK-21056
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.1.1, 2.2.0
Reporter: Bertrand Bossy
Given partitioned file relation (e.g. parquet):
{code}
root/a=../b=../c=..
{code}
InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times
numberOfPartitions(b) spark jobs sequentially to list leaf files, if both
numberOfPartitions(a) and numberOfPartitions(b) are below
{{spark.sql.sources.parallelPartitionDiscovery.threshold}} and
numberOfPartitions(c) is above
{{spark.sql.sources.parallelPartitionDiscovery.threshold}}
Since the jobs are run sequentially, the overhead of the jobs dominates and the
file listing operation can become significantly slower than listing the files
from the driver.
I propose that InMemoryFileIndex.listLeafFiles should launch at most one spark
job for listing leaf files.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]