[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

via GitHub Fri, 24 Feb 2023 09:38:52 -0800


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1444106912


   > Hi, I am very interested in this optimization and just have some questions 
when testing in a cluster with 4nodes/96 cores using spark3.1. Unfortunately， I 
see little improvement.
   
   You're likely to see improvement in cases where file i/o is the bottleneck. 
Most TPC-DS queries are join heavy and you will see little improvement there. 
You might do better with TPC-H. 
   
   > I am confused than whether it is necessary to keep 
spark.sql.parquet.enableVectorizedReader = false in spark when testing with 
spark 3.1 and how can I set the parquet buffer size. 
   
   It's probably best to keep the parquet (read) buffer size untouched.
   
   You should keep `spark.sql.parquet.enableVectorizedReader = true` 
irrespective of this. This feature improves I/O speed of reading raw data. The 
Spark vectorized reader kicks in after data is read from storage and converts 
the raw data into Spark's internal columnar representation and is faster than 
the row based version.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

Reply via email to