[GitHub] [hudi] xiarixiaoyao commented on pull request #5168: [HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file

GitBox Tue, 29 Mar 2022 18:54:28 -0700


xiarixiaoyao commented on pull request #5168:
URL: https://github.com/apache/hudi/pull/5168#issuecomment-1082534853



   @alexeykudinkin  addressed the comment
   but i still have a I have a question， do we really need force  set 
spark.sql.parquet.recordLevelFilter.enabled=true to mor/cow snapshot query
   test prepare
   set spark.sql.parquet.enableVectorizedReader=false， since 
spark.sql.parquet.recordLevelFilter.enabled is conflict with it.
   here is the benchmark result and bench mark code：
   ```
           prepareHoodieCowTable(tableName, new Path(f.getCanonicalPath, 
tableName).toUri.toString)
           val benchmark = new HoodieBenchmark("perf cow snapshot read", 
1000000)
           benchmark.addCase("recordLevelFilter enable") { _ =>
             // 
spark.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
 "false")
             
spark.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", 
"true")
             
spark.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
 "true")
             spark.sql(s"select c1, c3, c4, c5 from $tableName").where("c1 > 
100000").count()
           }
           benchmark.addCase("recordLevelFilter disable") { _ =>
             
spark.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown", 
"true")
             
spark.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
 "false")
             spark.sql(s"select c1, c3, c4, c5 from $tableName").where("c1 > 
100000").count()
           }
   
   perf cow snapshot read:                   Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   recordLevelFilter enable                            693            751       
   69          1.4         693.1       1.0X
   recordLevelFilter disable                           662            680       
   27          1.5         662.4       1.0X
   
   ```
   I don't see any performance improvement by set  
spark.sql.parquet.recordLevelFilter.enabled=true 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on pull request #5168: [HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file

Reply via email to