alexeykudinkin commented on code in PR #5165:
URL: https://github.com/apache/hudi/pull/5165#discussion_r845671911


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadIncrementalRelation.scala:
##########
@@ -95,6 +108,9 @@ class MergeOnReadIncrementalRelation(sqlContext: SQLContext,
       hadoopConf = new Configuration(conf)
     )
 
+    // setUp tableRequiredSchema
+    tableRequiredSchema = requiredSchema.structTypeSchema

Review Comment:
   @xiarixiaoyao sorry for the late reply.
   
   I understand that it reads it by batches, but my point still stands -- 
unfortunately, (i don't really understand why those 2 things are incompatible, 
but nevertheless) when we use vectorization -- Parquet filter push-down stops 
working, which means that now every row-group would be read in memory, before 
being filtered out as irrelevant. 
   
   For use-cases with localized storage (ie adjacent to the Compute) it might 
not be a problem, but it's more likely to be a problem for use-cases with 
disaggregated Compute and Storage (for ex, in Cloud) -- now we will have to 
read whole row-group for Spark to be able to filter it out in memory.
   
   I'd suggest additionally to the Benchmark that you ran locally, can we also 
test similar setup in Cloud (like AWS, or wherever), so that we get a better 
sense what the impact would be in Cloud env?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to