[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vova Kolmakov reassigned HUDI-7639: ----------------------------------- Assignee: (was: Vova Kolmakov) > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > ---------------------------------------------------------------------------------- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task > Reporter: Sagar Sumit > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)