[jira] [Created] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

Maxim Gekk (Jira) Mon, 07 Sep 2020 00:16:23 -0700

Maxim Gekk created SPARK-32810:
----------------------------------

             Summary: CSV/JSON data sources should avoid globbing paths when 
inferring schema
                 Key: SPARK-32810
                 URL: https://issues.apache.org/jira/browse/SPARK-32810
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.0
            Reporter: Maxim Gekk



The problem is that when the user doesn't specify the schema when reading a CSV 
table, The CSV file format and data source needs to infer schema, and it does 
so by creating a base DataSource relation, and there's a mismatch: 
*FileFormat.inferSchema* expects actual file paths without glob patterns, but 
*DataSource.paths* expects file paths in glob patterns.
 An example is demonstrated below:
{code:java}
^
|         DataSource.resolveRelation    tries to glob again (incorrectly) on 
glob pattern """[abc].csv"""
|         DataSource.apply                      ^
|       CSVDataSource.inferSchema               |
|     CSVFileFormat.inferSchema                 |
|   ...                                         |
|   DataSource.resolveRelation          globbed into """[abc].csv""", should be 
treated as verbatim path, not as glob pattern
|   DataSource.apply                            ^
| DataFrameReader.load                          |
|                                       input """\[abc\].csv"""
{code}
The same problem exists in the JSON data source as well. Ditto for MLlib's 
LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

Reply via email to