subject:"\[jira\] \[Updated\] \(HUDI\-7639\) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules"

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-05-15 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7639:
--
Status: Patch Available  (was: In Progress)

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-05-06 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7639:
--
Story Points: 5

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-04-26 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7639:
-
Sprint: Sprint 2024-03-25, Sprint 2023-04-26  (was: Sprint 2024-03-25)

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-04-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7639:
-
Labels: pull-request-available  (was: )

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-04-23 Thread Vova Kolmakov (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov updated HUDI-7639:

Status: In Progress  (was: Open)

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vova Kolmakov
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-04-18 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7639:

Sprint: Sprint 2024-03-25

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

6 matches

Site Navigation

Mail list logo

Footer information