[
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173594#comment-17173594
]
Yanjia Gary Li commented on HUDI-920:
-------------------------------------
The most challenging thing of the incremental query for MOR was how to handle
the filter based on commit time.
The current implementation simply use the Spark dataframe API
[https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L122]
But for RDD, we couldn't use this approach anymore.
Based on my investigation, the approach I can come up with:
* Use the PrunedFilteredScan and push the extra filter down. For this
approach, we will have to
**
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
"true")
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
"false")
** Filter Avro records ourselves on the record reading level.
** Trick Spark to scan the whole file when the user input won't trigger the
file scan like *df.count()*. If using the default ParquetReaderFunction, it
will just read the metadata for the count and won't apply the filter.
* Apply filter before the buildScan() return.
** We will have to ser/deser the UnsafeRow to get the _hoodie_commit_time._
* Go to Spark Planing to see if it's possible to force the filter somewhere.
* Explore Data source V2 to see if there is better support.
All we need is to force a filter before return the result to the user, but I
don't see this was support in data source V1. At this point, the first approach
is the most reasonable one for me and I can make a PR soon.
[~vinoth] [~bhasudha] [~uditme] WDYT?
> Incremental view on MOR table using Spark Datasource
> ----------------------------------------------------
>
> Key: HUDI-920
> URL: https://issues.apache.org/jira/browse/HUDI-920
> Project: Apache Hudi
> Issue Type: New Feature
> Components: Spark Integration
> Reporter: Yanjia Gary Li
> Assignee: Yanjia Gary Li
> Priority: Blocker
> Fix For: 0.6.0
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)