pengzhiwei2018 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-808734748
> CatalogFileIndex
Hi @umehrot2
I have test query the hive table, the `CatalogFileIndex` can really do the
partition prune. But the data source table is not work for the partition prune.
The main reason is that we have not pass the partition schema to the
`HadoopFsRelation` in the follow code in DefaultSource#getBaseFileOnlyView:
> // simply return as a regular parquet relation
> DataSource.apply(
> sparkSession = sqlContext.sparkSession,
> paths = extraReadPaths,
> userSpecifiedSchema = Option(schema),
> className = "parquet",
> options = optParams)
> .resolveRelation()
So that spark trait it as a non-partitioned table which make the partition
prune not work.
In our PR, we do a lot of the things to infer the partition schema which
support partition prune for both URL_ENCODE_PARTITIONING_OPT_KEY = true or
false. And also support the partition prune for the non-hive-styled partition.
By merge this PR, we can gain the follow things:
- Support Partition Prune for hudi datasource table.
- Support Non-Star query for hudi.
This is very import for spark sql integration for hudi. The table path in
the spark properties should not contain stars.
[HUDI-1415](https://github.com/apache/hudi/pull/2283) is still blocked on
this feature.
- We can do more optimize in the `HoodieFileIndex`. I have a few
optimizations in mind.
1、Support time travel query for hudi,just like the delta does. e.g.
`select * from h0 where __hoodie_commit_time_ = '20210328'`, we can get the
filter condition of `__hoodie_commit_time_` in the `HoodieFileIndex` and query
the specified version of data.
2、Support rowKey skip for query. e.g. select * from h0 where rowKey =
'10'; we can use the filter condition `rowKey = '10';` to skip the data by the
bloom filter( currently hoodie store the bloom filter of rowKey in the parquet
file), which can improve the query performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]