GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/20894
[SPARK-23786][SQL] Checking column names of csv headers
## What changes were proposed in this pull request?
Currently column names of headers in CSV files are not checked against
provided schema of CSV data. It could cause errors like showed in the
[SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786). I introduced
new CSV option - `checkHeader` (`true` by default) which enables checking of
column names against schema's fields. The checking is performed during
processing of the first partition of csv files. If names are not matched, the
following exception is thrown:
```
java.lang.IllegalArgumentException: Fields in the header of csv file are
not matched to field names of the schema:
Header: depth, temperature
Schema: temperature, depth
```
## How was this patch tested?
The changes were tested by existing tests of CSVSuite and by 2 new tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 check-column-names
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20894.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20894
----
commit 112ce2d34d0d039711777351c1ab8e74629fc8e6
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-20T15:30:44Z
Checks column names are compatible to provided schema
commit a85ccce23c3c5ee69ff321303ad830c71dd05931
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-20T20:51:03Z
Checking header is matched to schema in per-line mode
commit 75e15345b6a5a9e807375fdf465dccfce4ea62c7
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-20T21:36:56Z
Extract header and check that it is matched to schema
commit 8eb45b8b634ba2c9b641de12e09f17c63240ccc4
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T10:57:30Z
Checking column names in header in multiLine mode
commit 9b1a9862531b8d3fb3cffce75126413ca9a844b9
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T11:13:17Z
Adding the checkHeader option with true by default
commit 64426332b2ab42a1cd9c5a05a77e90332572bbec
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T11:25:31Z
Fix csv test by changing headers or disabling header checking
commit 9440d8a5c097a1d8e111b397fbda9e54751b7a84
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T11:36:21Z
Adding comment for the checkHeader option
commit 9f91ce73c5c313a9c51067a81e395e9385016ec5
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T11:42:48Z
Added comments
commit 0878f7aad3c074e63ac3ab1d6e471ce8b988f278
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T12:09:20Z
Adding a space between column names
commit a341dd79c976df59fc8bffb272449973a09b86fe
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T15:15:14Z
Fix a test: checking name duplication in schemas
commit 98c27eaa80cf3fae11092d78f22122688e4041a4
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-23T21:04:57Z
Fixing the test and adding ticket number to test's title
commit 811df6fa7b17ff12bdd70318cf330a0f54815397
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-23T21:10:20Z
Refactoring - removing unneeded parameter
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]