GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21415
[SPARK-24244][SPARK-24368][SQL] Passing only required columns to the CSV
parser
## What changes were proposed in this pull request?
uniVocity parser allows to specify only required column names or indexes
for [parsing](https://www.univocity.com/pages/parsers-tutorial) like:
```
// Here we select only the columns by their indexes.
// The parser just skips the values in other columns
parserSettings.selectIndexes(4, 0, 1);
CsvParser parser = new CsvParser(parserSettings);
```
In this PR, I propose to extract indexes from required schema and pass them
into the CSV parser. Benchmarks show the following improvements in parsing of
1000 columns:
```
Select 100 columns out of 1000: x1.76
Select 1 column out of 1000: x2
```
**Note**: Comparing to current implementation, the changes can return
different result for malformed rows in the `DROPMALFORMED` and `FAILFAST` modes
if only subset of all columns is requested. To have previous behavior, set
`spark.sql.csv.parser.columnPruning.enabled` to `false`.
## How was this patch tested?
It was tested by new test which selects 3 columns out of 15, by existing
tests and by new benchmarks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 csv-column-pruning2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21415.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21415
----
commit 9cffa0fccc33552e8fce3580a9a665b022f5bf22
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T20:03:11Z
Adding tests for select only requested columns
commit fdbcbe3536aee04e6a84b72ac319726614416bc3
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T20:42:08Z
Select indexes of required columns only
commit 578f47b0f32a76caf6c9ede8763c9cf85a1c83e9
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-24T10:41:29Z
Fix the case when number of parsed fields are not matched to required schema
commit 0f942c308dca173dad8f421e893066b8c03d35a3
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-24T11:07:55Z
Using selectIndexes if required number of columns are less than its total
number.
commit c4b11601e9c264729e141fff3dc653d868a7ad69
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-24T11:48:43Z
Fix the test: force to read all columns
commit 8cf6eab952d79628cb8ee2ff7b92dadae60ec686
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-06T20:55:35Z
Fix merging conflicts
commit 5b2f0b9d7346f927842bc1a2089a7299876f1894
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T11:52:08Z
Benchmarks for many columns
commit 6d1e902c0011e88dbafb65c4ad6e7431370ed12d
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T12:59:58Z
Make size of requiredSchema equals to amount of selected columns
commit 4525795f7337cbd081f569cd79d7f90cb58edbee
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T13:36:54Z
Removing selection of all columns
commit 8809cecf93d8e7a97eca827d9e8637a7eb5b2449
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T13:50:44Z
Updating benchmarks for select indexes
commit dc97ceb96185ed2eaa05fbe1aee8ecfe8ccb7e7d
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-05T19:19:17Z
Addressing Herman's review comments
commit 51b31483263e13cd85b19b3efea65188945eda99
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-10T18:39:38Z
Updated benchmark result for recent changes
commit e3958b1468b490b548574b53512f0d83850e6f6f
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-10T18:46:17Z
Add ticket number to test title
commit a4a0a549156a15011c33c7877a35f244d75b7a4f
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-10T19:02:24Z
Removing unnecessary benchmark
commit fa860157c982846524bd8f151daf8a2154117b34
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-13T18:49:49Z
Updating the migration guide
commit 15528d20a74904c14c58bf3ad54c9a552c519430
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-13T18:55:06Z
Moving some values back as it was.
commit f90daa7ea33d119be978c27de10978c2d6281e25
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-13T18:58:20Z
Renaming the test title
commit 4d9873d39277b9cbaee892957c06bfc2cb9a52f1
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-17T20:02:47Z
Improving of the migration guide
commit 7dcfc7a7664fcd5311cb352f0ea7a24b3cc1c639
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-17T20:12:49Z
Merge remote-tracking branch 'origin/master' into csv-column-pruning
# Conflicts:
#
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
#
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
commit f89eeb7f7ba86888ad3f7994577a4d4ebbf09197
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-17T20:39:10Z
Fix example
commit 6ff6d4fda9f7e8ee43d7aa04818204de4c49440b
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-18T15:25:50Z
Adding spark.sql.csv.parser.columnPruning.enabled
commit 0aef16b5e9017fb398e0df2f3694a1db1f4d7cb8
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-23T19:31:14Z
Add columnPruning as a parameter for CSVOptions
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]