Maxim Gekk created SPARK-32810:
----------------------------------
Summary: CSV/JSON data sources should avoid globbing paths when
inferring schema
Key: SPARK-32810
URL: https://issues.apache.org/jira/browse/SPARK-32810
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk
The problem is that when the user doesn't specify the schema when reading a CSV
table, The CSV file format and data source needs to infer schema, and it does
so by creating a base DataSource relation, and there's a mismatch:
*FileFormat.inferSchema* expects actual file paths without glob patterns, but
*DataSource.paths* expects file paths in glob patterns.
An example is demonstrated below:
{code:java}
^
| DataSource.resolveRelation tries to glob again (incorrectly) on
glob pattern """[abc].csv"""
| DataSource.apply ^
| CSVDataSource.inferSchema |
| CSVFileFormat.inferSchema |
| ... |
| DataSource.resolveRelation globbed into """[abc].csv""", should be
treated as verbatim path, not as glob pattern
| DataSource.apply ^
| DataFrameReader.load |
| input """\[abc\].csv"""
{code}
The same problem exists in the JSON data source as well. Ditto for MLlib's
LibSVM data source.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]