GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/23091
[SPARK-26122][SQL] Support encoding for multiLine in CSV datasource
## What changes were proposed in this pull request?
In the PR, I propose to pass the CSV option `encoding`/`charset` to
`uniVocity` parser to allow parsing CSV files in different encodings when
`multiLine` is enabled. The value of the option is passed to the `beginParsing`
method of `CSVParser`.
## How was this patch tested?
Added new test to `CSVSuite` for different encodings and enabled/disabled
header.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 csv-miltiline-encoding
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23091.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23091
----
commit 1a7a0cb4430f847ac95c0c764393003581415103
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-11-19T20:51:04Z
Added a test
commit cd57ec5833bbfb5f0b33d63a56b48d25924f6be1
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-11-19T21:07:41Z
Test multiple encodings
commit 1c76f8944979df8a7b9b8181ebfa38933c3f2c00
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-11-19T21:09:04Z
Pass encoding to uniVocity parser
commit 16eb14c73f3fad8d83fee41d5665b52f180daf73
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-11-19T21:22:23Z
Test with header and without it
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]