GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/22656
[SPARK-25669][SQL] Check CSV header only when it exists
## What changes were proposed in this pull request?
Currently the first row of dataset of CSV strings is compared to field
names of user specified or inferred schema independently of presence of CSV
header. It causes false-positive error messages. For example, parsing `"1,2"`
outputs the error:
```java
java.lang.IllegalArgumentException: CSV header does not conform to the
schema.
Header: 1, 2
Schema: _c0, _c1
Expected: _c0 but found: 1
```
In the PR, I propose:
- Checking CSV header only when it exists
- Filter header from the input dataset only if it exists
## How was this patch tested?
Added a test to `CSVSuite` which reproduces the issue.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 inferred-header-check
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22656.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22656
----
commit 676e5580e5f01c7800c734634911b65a91531d4b
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-10-06T12:33:28Z
Don't need to check inferred field names to the first row
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]