[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #5168: [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file

GitBox Tue, 29 Mar 2022 03:04:23 -0700


xiarixiaoyao commented on a change in pull request #5168:
URL: https://github.com/apache/hudi/pull/5168#discussion_r837295103




##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala
##########
@@ -48,6 +48,13 @@ class MergeOnReadIncrementalRelation(sqlContext: SQLContext,
 
   override type FileSplit = HoodieMergeOnReadFileSplit
 
+  override def imbueConfigs(sqlContext: SQLContext): Unit = {
+    
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown",
 "true")
+    
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
 "true")

Review comment:
       mor  incremental  query  need to filter data through file level filters
   from spark side
   ```
     val PARQUET_RECORD_FILTER_ENABLED = 
buildConf("spark.sql.parquet.recordLevelFilter.enabled")
       .doc("If true, enables Parquet's native record-level filtering using the 
pushed down " +
         "filters. " +
         s"This configuration only has an effect when 
'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' " +
         "is enabled and the vectorized reader is not used. You can ensure the 
vectorized reader " +
         s"is not used by setting '${PARQUET_VECTORIZED_READER_ENABLED.key}' to 
false.")
       .version("2.3.0")
       .booleanConf
       .createWithDefault(false)
   ```
   We must ensure that these two configuration items take effect, otherwise the 
read data will be repeated
   
   i also post another pr https://github.com/apache/hudi/pull/5165 to enable 
vectorize read for all cow/mor read




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #5168: [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file

Reply via email to