xiarixiaoyao commented on a change in pull request #5165:
URL: https://github.com/apache/hudi/pull/5165#discussion_r840644135
##########
File path:
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala
##########
@@ -95,6 +108,9 @@ class MergeOnReadIncrementalRelation(sqlContext: SQLContext,
hadoopConf = new Configuration(conf)
)
+ // setUp tableRequiredSchema
+ tableRequiredSchema = requiredSchema.structTypeSchema
Review comment:
> @xiarixiaoyao while i understand your intent that we can speed things
up for some of the queries, i don't think enabling vectorized reader for
Incremental queries will bring an universal speed up for all queries --
vectorized reader would mean that we first fetch records and only do filtering
in memory, which i could clearly could be a disadvantage for large tables.
>
> So my suggestion would be to NOT enable vectorized-reader for Incremental
queries by default, but instead let individual users to decide on whether they
want to enable it or not.
@alexeykudinkin I don't think record level filter has better performance.
Spark is off by default for this parameter.
Vectorization is to read batch by batch, 4096 rows of data in each batch,
and then filter directly in memory
Read out the data, vectorize it, and then filter it better. I will take out
100g data to give the test results
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]