[jira] [Assigned] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

Vova Kolmakov (Jira) Tue, 23 Apr 2024 05:14:58 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vova Kolmakov reassigned HUDI-7639:
-----------------------------------

    Assignee:     (was: Vova Kolmakov)

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-7639
>                 URL: https://issues.apache.org/jira/browse/HUDI-7639
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

Reply via email to