alexeykudinkin commented on code in PR #5165:
URL: https://github.com/apache/hudi/pull/5165#discussion_r845671911
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -95,6 +108,9 @@ class MergeOnReadIncrementalRelation(sqlContext: SQLContext,
hadoopConf = new Configuration(conf)
)
+ // setUp tableRequiredSchema
+ tableRequiredSchema = requiredSchema.structTypeSchema
Review Comment:
@xiarixiaoyao sorry for the late reply.
I understand that it reads it by batches, but my point still stands --
unfortunately, (i don't really understand why those 2 things are incompatible,
but nevertheless) when we use vectorization -- Parquet filter push-down stops
working, which means that now every row-group would be read in memory, before
being filtered out as irrelevant.
For use-cases with localized storage (ie adjacent to the Compute) it might
not be a problem, but it's more likely to be a problem for use-cases with
disaggregated Compute and Storage (for ex, in Cloud) -- now we will have to
read whole row-group for Spark to be able to filter it out in memory.
I'd suggest additionally to the Benchmark that you ran locally, can we also
test similar setup in Cloud (like AWS, or wherever), so that we get a better
sense what the impact would be in Cloud env?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]