GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21909
[SPARK-24959][SQL] Speed up count() for JSON and CSV
## What changes were proposed in this pull request?
In the PR, I propose to skip invoking of the CSV/JSON parser per each line
in the case if the required schema is empty. Added benchmarks for `count()`
shows performance improvement up to **3.5 times**.
Before:
```
Count a dataset with 10 columns: Best/Avg Time(ms) Rate(M/s) Per
Row(ns)
--------------------------------------------------------------------------------------
JSON count() 7676 / 7715 1.3
767.6
CSV count() 3309 / 3363 3.0
330.9
```
After:
```
Count a dataset with 10 columns: Best/Avg Time(ms) Rate(M/s) Per
Row(ns)
--------------------------------------------------------------------------------------
JSON count() 2104 / 2156 4.8
210.4
CSV count() 2332 / 2386 4.3
233.2
```
## How was this patch tested?
It was tested by `CSVSuite` and `JSONSuite` as well as on added benchmarks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 empty-schema-optimization
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21909.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21909
----
commit bc4ce261a2d13be0a31b18f006da79b55880d409
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-28T15:31:20Z
Added a benchmark for count()
commit 91250d21d4bb451062873c59df6fe3b4669bc5ff
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-28T15:50:15Z
Added a CSV benchmark for count()
commit bdc5ea540b9eb62bb28606bdeb311ce5662e4bf7
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-28T15:59:44Z
Speed up count()
commit d40f9bb229ab8ea9e2d95499ae203f7c41098bcd
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-28T16:00:17Z
Updating CSV and JSON benchmarks for count()
commit abd8572497ff742ef6ea942864195be75a40ca71
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-28T16:23:03Z
Fix benchmark's output
commit 359c4fcbfdb4f4e77faa3977f381dc8e819e46fa
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-28T16:23:44Z
Uncomment other benchmarks
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]