parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1129363998
I have some numbers from an internal benchmark using Spark. I didn't see any
benchmarks in the Parquet codebase that I could reuse.
Here are the numbers from my own benchmark -
- 10 runs, each run reads all columns from store_sales (the largest table)
in the TPC-DS (100G) dataset
`spark.sql("select * from store_sales")`
- Sync reader with default 8MB buffer size, Async reader with 1MB buffer
size (achieves better pipelining)
- Run on Macbook Pro, reading from S3. Spark has 6 cores.
- All times in seconds
| Run | Async | Sync | Async (w/o outliers)| Sync (w/o outliers) |
| ---:| ---:| ---:| ---:| ---:|
|1| 84| 102| - | - |
|2| 90| 366| 90| 366|
|3| 78| 156| - | 156|
|4| 84| 128| 84| - |
|5| 108|402| - | - |
|6| 90| 432| 90| - |
|7| 84| 378| 84| 378|
|8| 108|324| - | 324|
|9| 90| 318| 90| 318|
|10|90| 282| 90| 282|
|Average| 90.6| 288.8| 88| 304|
|Median| 90| 321| **90**| **321**|
|StdDev| 9.98| 119.
After removing the two highest and two lowest runs for each case, and taking
the median value:
Async: 90 sec
Sync: 321 sec
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]