GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21415

    [SPARK-24244][SPARK-24368][SQL] Passing only required columns to the CSV 
parser

    ## What changes were proposed in this pull request?
    
    uniVocity parser allows to specify only required column names or indexes 
for [parsing](https://www.univocity.com/pages/parsers-tutorial) like:
    
    ```
    // Here we select only the columns by their indexes.
    // The parser just skips the values in other columns
    parserSettings.selectIndexes(4, 0, 1);
    CsvParser parser = new CsvParser(parserSettings);
    ```
    In this PR, I propose to extract indexes from required schema and pass them 
into the CSV parser. Benchmarks show the following improvements in parsing of 
1000 columns:
    
    ```
    Select 100 columns out of 1000: x1.76
    Select 1 column out of 1000: x2
    ```
    
    **Note**: Comparing to current implementation, the changes can return 
different result for malformed rows in the `DROPMALFORMED` and `FAILFAST` modes 
if only subset of all columns is requested. To have previous behavior, set 
`spark.sql.csv.parser.columnPruning.enabled` to `false`.
     
    ## How was this patch tested?
    
    It was tested by new test which selects 3 columns out of 15, by existing 
tests and by new benchmarks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 csv-column-pruning2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21415
    
----
commit 9cffa0fccc33552e8fce3580a9a665b022f5bf22
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T20:03:11Z

    Adding tests for select only requested columns

commit fdbcbe3536aee04e6a84b72ac319726614416bc3
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-21T20:42:08Z

    Select indexes of required columns only

commit 578f47b0f32a76caf6c9ede8763c9cf85a1c83e9
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-24T10:41:29Z

    Fix the case when number of parsed fields are not matched to required schema

commit 0f942c308dca173dad8f421e893066b8c03d35a3
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-24T11:07:55Z

    Using selectIndexes if required number of columns are less than its total 
number.

commit c4b11601e9c264729e141fff3dc653d868a7ad69
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-03-24T11:48:43Z

    Fix the test: force to read all columns

commit 8cf6eab952d79628cb8ee2ff7b92dadae60ec686
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-06T20:55:35Z

    Fix merging conflicts

commit 5b2f0b9d7346f927842bc1a2089a7299876f1894
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T11:52:08Z

    Benchmarks for many columns

commit 6d1e902c0011e88dbafb65c4ad6e7431370ed12d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T12:59:58Z

    Make size of requiredSchema equals to amount of selected columns

commit 4525795f7337cbd081f569cd79d7f90cb58edbee
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T13:36:54Z

    Removing selection of all columns

commit 8809cecf93d8e7a97eca827d9e8637a7eb5b2449
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-04-29T13:50:44Z

    Updating benchmarks for select indexes

commit dc97ceb96185ed2eaa05fbe1aee8ecfe8ccb7e7d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-05T19:19:17Z

    Addressing Herman's review comments

commit 51b31483263e13cd85b19b3efea65188945eda99
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-10T18:39:38Z

    Updated benchmark result for recent changes

commit e3958b1468b490b548574b53512f0d83850e6f6f
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-10T18:46:17Z

    Add ticket number to test title

commit a4a0a549156a15011c33c7877a35f244d75b7a4f
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-10T19:02:24Z

    Removing unnecessary benchmark

commit fa860157c982846524bd8f151daf8a2154117b34
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-13T18:49:49Z

    Updating the migration guide

commit 15528d20a74904c14c58bf3ad54c9a552c519430
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-13T18:55:06Z

    Moving some values back as it was.

commit f90daa7ea33d119be978c27de10978c2d6281e25
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-13T18:58:20Z

    Renaming the test title

commit 4d9873d39277b9cbaee892957c06bfc2cb9a52f1
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-17T20:02:47Z

    Improving of the migration guide

commit 7dcfc7a7664fcd5311cb352f0ea7a24b3cc1c639
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-17T20:12:49Z

    Merge remote-tracking branch 'origin/master' into csv-column-pruning
    
    # Conflicts:
    #   
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala
    #   
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

commit f89eeb7f7ba86888ad3f7994577a4d4ebbf09197
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-17T20:39:10Z

    Fix example

commit 6ff6d4fda9f7e8ee43d7aa04818204de4c49440b
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-18T15:25:50Z

    Adding spark.sql.csv.parser.columnPruning.enabled

commit 0aef16b5e9017fb398e0df2f3694a1db1f4d7cb8
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-05-23T19:31:14Z

    Add columnPruning as a parameter for CSVOptions

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to