GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21296

    [SPARK-24244][SQL] CSV column pruning

    ## What changes were proposed in this pull request?
    
    uniVocity parser allows to specify only required column names or indexes 
for [parsing](https://www.univocity.com/pages/parsers-tutorial) like:
    
    ```
    // Here we select only the columns by their indexes.
    // The parser just skips the values in other columns
    parserSettings.selectIndexes(4, 0, 1);
    CsvParser parser = new CsvParser(parserSettings);
    ```
    In this PR, I propose to extract indexes from required schema and pass them 
into the CSV parser. Benchmarks show the following improvements in parsing of 
1000 columns:
    
    ```
    Select 100 columns out of 1000: x1.76
    Select 1 column out of 1000: x2
    ```
    
    **Note**: Comparing to current implementation, the changes can return 
different result for malformed rows if only subset of all columns is requested.
     
    ## How was this patch tested?
    
    It was tested by new test which selects 3 columns out of 15, by existing 
tests and by new benchmarks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 csv-column-pruning

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21296.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21296
    
----
commit 9cffa0fccc33552e8fce3580a9a665b022f5bf22
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T20:03:11Z

    Adding tests for select only requested columns

commit fdbcbe3536aee04e6a84b72ac319726614416bc3
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T20:42:08Z

    Select indexes of required columns only

commit 578f47b0f32a76caf6c9ede8763c9cf85a1c83e9
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-24T10:41:29Z

    Fix the case when number of parsed fields are not matched to required schema

commit 0f942c308dca173dad8f421e893066b8c03d35a3
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-24T11:07:55Z

    Using selectIndexes if required number of columns are less than its total 
number.

commit c4b11601e9c264729e141fff3dc653d868a7ad69
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-24T11:48:43Z

    Fix the test: force to read all columns

commit 8cf6eab952d79628cb8ee2ff7b92dadae60ec686
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-06T20:55:35Z

    Fix merging conflicts

commit 5b2f0b9d7346f927842bc1a2089a7299876f1894
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T11:52:08Z

    Benchmarks for many columns

commit 6d1e902c0011e88dbafb65c4ad6e7431370ed12d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T12:59:58Z

    Make size of requiredSchema equals to amount of selected columns

commit 4525795f7337cbd081f569cd79d7f90cb58edbee
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T13:36:54Z

    Removing selection of all columns

commit 8809cecf93d8e7a97eca827d9e8637a7eb5b2449
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T13:50:44Z

    Updating benchmarks for select indexes

commit dc97ceb96185ed2eaa05fbe1aee8ecfe8ccb7e7d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-05T19:19:17Z

    Addressing Herman's review comments

commit 51b31483263e13cd85b19b3efea65188945eda99
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-10T18:39:38Z

    Updated benchmark result for recent changes

commit e3958b1468b490b548574b53512f0d83850e6f6f
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-10T18:46:17Z

    Add ticket number to test title

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to