[jira] [Created] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

Sagar Sumit (Jira) Thu, 18 Apr 2024 09:27:04 -0700

Sagar Sumit created HUDI-7639:
---------------------------------

             Summary: Refactor HoodieFileIndex so that different indexes can be 
used via optimizer rules
                 Key: HUDI-7639
                 URL: https://issues.apache.org/jira/browse/HUDI-7639
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Sagar Sumit
             Fix For: 1.0.0

Currently, `HoodieFileIndex` is responsible for partition pruning as well as
file skipping. All indexes are being used in
[lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
method through if-else branches. This is not only hard to maintain as we add
more indexes, but also induces a static hierarchy. Instead, we need more
flexibility so that we can alter logical plan based on availability of indexes.
For partition pruning in Spark, we already have
[HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
rule but it is injected during the operator optimization batch and it does not
modify the result of the LogicalPlan. To be fully extensible, we should be able
to rewrite the LogicalPlan. We should be able to inject rules after partition
pruning after the operator optimization batch and before any CBO rules that
depend on stats. Spark provides
[injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
API to do so, however it is only available in Spark 3.1.0 onwards.

The goal of this ticket is to refactor index hierarchy and create new rules
such that Spark version < 3.1.0 still go via the old path, while later versions
can modify the plan using an appropriate index and inject as a pre-CBO rule.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

Reply via email to