[GitHub] [hudi] pengzhiwei2018 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

GitBox Sat, 27 Mar 2021 06:40:45 -0700


pengzhiwei2018 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-808734748



   > CatalogFileIndex
   
   Hi @umehrot2
   I have test  query the hive table, the `CatalogFileIndex` can really do the 
partition prune. But the data source table is not work for the partition prune. 
The main reason is that we have not pass the partition schema to the 
`HadoopFsRelation` in the follow code in DefaultSource#getBaseFileOnlyView:
   
   > // simply return as a regular parquet relation
   >       DataSource.apply(
   >         sparkSession = sqlContext.sparkSession,
   >         paths = extraReadPaths,
   >         userSpecifiedSchema = Option(schema),
   >         className = "parquet",
   >         options = optParams)
   >         .resolveRelation()
   So that spark trait it as a non-partitioned table which make the partition 
prune not work.
   
   In our PR, we do a lot of the things to infer the partition schema which 
support partition prune for both URL_ENCODE_PARTITIONING_OPT_KEY = true or 
false. And also support the partition prune for the non-hive-styled partition.
   
   By merge this PR, we can gain the follow things:
   - Support Partition Prune for hudi  datasource table.
   - Support Non-Star query for hudi.
     This is very import for spark sql integration for hudi. The table path in 
the spark properties should not contain stars.
     [HUDI-1415](https://github.com/apache/hudi/pull/2283) is still blocked on 
this feature.
   - We can do more optimize in the `HoodieFileIndex`. I have a few 
optimizations in mind.
      1、Support time travel query for hudi，just like the delta does.  e.g. 
`select * from h0 where __hoodie_commit_time_ = '20210328'`, we can get the 
filter condition of `__hoodie_commit_time_` in the `HoodieFileIndex` and query 
the specified version of data.
      2、Support rowKey skip for query. e.g. select * from h0 where rowKey = 
'10';  we can use the filter condition `rowKey = '10';` to skip the data by the 
bloom filter( currently hoodie store the bloom filter of rowKey in the parquet 
file), which can improve the query performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

Reply via email to