Gengliang Wang created SPARK-27627:
--------------------------------------

             Summary: Make option "pathGlobFilter" as a general option for all 
file sources
                 Key: SPARK-27627
                 URL: https://issues.apache.org/jira/browse/SPARK-27627
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Gengliang Wang


Background:
The data source option "pathGlobFilter" is introduced for Binary file format: 
https://github.com/apache/spark/pull/24354 , which can be used for filtering 
file names, e.g. reading "*.png" files only while there is "*.json" files in 
the same directory.

Proposal:
Make the option "pathGlobFilter" as a general option for all file sources. The 
path filtering should happen in the path globbing on Driver.

Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly. 

Impact:
1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to