[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7639: -- Status: Patch Available (was: In Progress) > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7639: -- Story Points: 5 > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7639: - Sprint: Sprint 2024-03-25, Sprint 2023-04-26 (was: Sprint 2024-03-25) > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7639: - Labels: pull-request-available (was: ) > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vova Kolmakov >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vova Kolmakov updated HUDI-7639: Status: In Progress (was: Open) > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vova Kolmakov >Priority: Major > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7639: Sprint: Sprint 2024-03-25 > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)