[
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192684#comment-17192684
]
Apache Spark commented on SPARK-32810:
--------------------------------------
User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29690
> CSV/JSON data sources should avoid globbing paths when inferring schema
> -----------------------------------------------------------------------
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
> Reporter: Maxim Gekk
> Assignee: Maxim Gekk
> Priority: Major
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> The problem is that when the user doesn't specify the schema when reading a
> CSV table, The CSV file format and data source needs to infer schema, and it
> does so by creating a base DataSource relation, and there's a mismatch:
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but
> *DataSource.paths* expects file paths in glob patterns.
> An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelation tries to glob again (incorrectly) on
> glob pattern """[abc].csv"""
> | DataSource.apply ^
> | CSVDataSource.inferSchema |
> | CSVFileFormat.inferSchema |
> | ... |
> | DataSource.resolveRelation globbed into """[abc].csv""", should
> be treated as verbatim path, not as glob pattern
> | DataSource.apply ^
> | DataFrameReader.load |
> | input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's
> LibSVM data source.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]