GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21296
[SPARK-24244][SQL] CSV column pruning
## What changes were proposed in this pull request?
uniVocity parser allows to specify only required column names or indexes
for [parsing](https://www.univocity.com/pages/parsers-tutorial) like:
```
// Here we select only the columns by their indexes.
// The parser just skips the values in other columns
parserSettings.selectIndexes(4, 0, 1);
CsvParser parser = new CsvParser(parserSettings);
```
In this PR, I propose to extract indexes from required schema and pass them
into the CSV parser. Benchmarks show the following improvements in parsing of
1000 columns:
```
Select 100 columns out of 1000: x1.76
Select 1 column out of 1000: x2
```
**Note**: Comparing to current implementation, the changes can return
different result for malformed rows if only subset of all columns is requested.
## How was this patch tested?
It was tested by new test which selects 3 columns out of 15, by existing
tests and by new benchmarks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 csv-column-pruning
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21296.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21296
----
commit 9cffa0fccc33552e8fce3580a9a665b022f5bf22
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T20:03:11Z
Adding tests for select only requested columns
commit fdbcbe3536aee04e6a84b72ac319726614416bc3
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-21T20:42:08Z
Select indexes of required columns only
commit 578f47b0f32a76caf6c9ede8763c9cf85a1c83e9
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-24T10:41:29Z
Fix the case when number of parsed fields are not matched to required schema
commit 0f942c308dca173dad8f421e893066b8c03d35a3
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-24T11:07:55Z
Using selectIndexes if required number of columns are less than its total
number.
commit c4b11601e9c264729e141fff3dc653d868a7ad69
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-03-24T11:48:43Z
Fix the test: force to read all columns
commit 8cf6eab952d79628cb8ee2ff7b92dadae60ec686
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-06T20:55:35Z
Fix merging conflicts
commit 5b2f0b9d7346f927842bc1a2089a7299876f1894
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T11:52:08Z
Benchmarks for many columns
commit 6d1e902c0011e88dbafb65c4ad6e7431370ed12d
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T12:59:58Z
Make size of requiredSchema equals to amount of selected columns
commit 4525795f7337cbd081f569cd79d7f90cb58edbee
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T13:36:54Z
Removing selection of all columns
commit 8809cecf93d8e7a97eca827d9e8637a7eb5b2449
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-04-29T13:50:44Z
Updating benchmarks for select indexes
commit dc97ceb96185ed2eaa05fbe1aee8ecfe8ccb7e7d
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-05T19:19:17Z
Addressing Herman's review comments
commit 51b31483263e13cd85b19b3efea65188945eda99
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-10T18:39:38Z
Updated benchmark result for recent changes
commit e3958b1468b490b548574b53512f0d83850e6f6f
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-10T18:46:17Z
Add ticket number to test title
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]