Gengliang Wang created SPARK-27627:
--------------------------------------
Summary: Make option "pathGlobFilter" as a general option for all
file sources
Key: SPARK-27627
URL: https://issues.apache.org/jira/browse/SPARK-27627
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Background:
The data source option "pathGlobFilter" is introduced for Binary file format:
https://github.com/apache/spark/pull/24354 , which can be used for filtering
file names, e.g. reading "*.png" files only while there is "*.json" files in
the same directory.
Proposal:
Make the option "pathGlobFilter" as a general option for all file sources. The
path filtering should happen in the path globbing on Driver.
Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly.
Impact:
1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]