Bertrand Bossy created SPARK-21056:
--------------------------------------

             Summary: InMemoryFileIndex.listLeafFiles should create at most one 
spark job when listing files in parallel
                 Key: SPARK-21056
                 URL: https://issues.apache.org/jira/browse/SPARK-21056
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.1, 2.2.0
            Reporter: Bertrand Bossy


Given partitioned file relation (e.g. parquet):
{code}
root/a=../b=../c=..
{code}
InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times 
numberOfPartitions(b) spark jobs sequentially to list leaf files, if both 
numberOfPartitions(a) and numberOfPartitions(b) are below 
{{spark.sql.sources.parallelPartitionDiscovery.threshold}} and 
numberOfPartitions(c) is above 
{{spark.sql.sources.parallelPartitionDiscovery.threshold}}

Since the jobs are run sequentially, the overhead of the jobs dominates and the 
file listing operation can become significantly slower than listing the files 
from the driver.

I propose that InMemoryFileIndex.listLeafFiles should launch at most one spark 
job for listing leaf files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to