[ 
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173594#comment-17173594
 ] 

Yanjia Gary Li commented on HUDI-920:
-------------------------------------

The most challenging thing of the incremental query for MOR was how to handle 
the filter based on commit time.

The current implementation simply use the Spark dataframe API 
[https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L122]

But for RDD, we couldn't use this approach anymore.

Based on my investigation, the approach I can come up with:
 * Use the PrunedFilteredScan and push the extra filter down. For this 
approach, we will have to

 ** 
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
 "true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
 "false")
 ** Filter Avro records ourselves on the record reading level.
 ** Trick Spark to scan the whole file when the user input won't trigger the 
file scan like *df.count()*. If using the default ParquetReaderFunction, it 
will just read the metadata for the count and won't apply the filter.
 * Apply filter before the buildScan() return.
 ** We will have to ser/deser the UnsafeRow to get the _hoodie_commit_time._
 * Go to Spark Planing to see if it's possible to force the filter somewhere. 
 * Explore Data source V2 to see if there is better support.

All we need is to force a filter before return the result to the user, but I 
don't see this was support in data source V1. At this point, the first approach 
is the most reasonable one for me and I can make a PR soon.

[~vinoth] [~bhasudha] [~uditme] WDYT?

> Incremental view on MOR table using Spark Datasource
> ----------------------------------------------------
>
>                 Key: HUDI-920
>                 URL: https://issues.apache.org/jira/browse/HUDI-920
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Spark Integration
>            Reporter: Yanjia Gary Li
>            Assignee: Yanjia Gary Li
>            Priority: Blocker
>             Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to