[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #5165: [HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance

GitBox Fri, 01 Apr 2022 07:38:01 -0700


xiarixiaoyao commented on a change in pull request #5165:
URL: https://github.com/apache/hudi/pull/5165#discussion_r840644135




##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala
##########
@@ -95,6 +108,9 @@ class MergeOnReadIncrementalRelation(sqlContext: SQLContext,
       hadoopConf = new Configuration(conf)
     )
 
+    // setUp tableRequiredSchema
+    tableRequiredSchema = requiredSchema.structTypeSchema

Review comment:
       > @xiarixiaoyao while i understand your intent that we can speed things 
up for some of the queries, i don't think enabling vectorized reader for 
Incremental queries will bring an universal speed up for all queries -- 
vectorized reader would mean that we first fetch records and only do filtering 
in memory, which i could clearly could be a disadvantage for large tables.
   > 
   > So my suggestion would be to NOT enable vectorized-reader for Incremental 
queries by default, but instead let individual users to decide on whether they 
want to enable it or not.
   
   @alexeykudinkin  I don't think record level filter has better performance. 
Spark is off by default for this parameter. 
   Vectorization is to read batch by batch, 4096 rows of data in each batch, 
and then filter directly in memory
   
   Read out the data, vectorize it, and then filter it better. I will take out 
100g data to give the test results




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #5165: [HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance

Reply via email to