Hi Andrew, I carefully went through the document and the PR. Thank you for this!
I believe that the improvements on the PR alone are a major benefit, as it supports custom logical plans out of the box, which opens a lot of possibilities to users. I also like the idea of migrating from enum to a trait `LogicalPlanNode` for the advantages that you mentioned. I am undecided whether we should go for it because we may have to trade-off node-specific optimizations: I currently can't tell whether that trade-off really exists, or whether we are able to add trait methods (such as `prevent_predicate_push_down_columns` in your PR) that allows to achieve full optimizations: I would need to implement this to some of our nodes to check if we could land on a good end result, which of course entails the risk that we may scrap it if we can't arrive at one. Maybe @Andy Grove <andygrov...@gmail.com> or others can formulate a better judgment call here. In summary, I very much agree with the content of the proposal, but I currently do not have sufficient information to evaluate the tradeoffs. One concrete path is to try to migrate a difficult node, e.g. Aggregate, and see how difficult it is to make our current tests pass. Best, Jorge On Fri, Aug 21, 2020 at 11:37 PM Andrew Lamb <al...@influxdata.com> wrote: > I would like to propose and request comments from the DataFusion community > on adding user defined LogicalPlanNodes. > > A detailed proposal is in this google doc > < > https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit# > > > (comments > enabled, copy/pasted below for your convenience). > > Here is a PR showing the approach how it could work: > https://github.com/apache/arrow/pull/8020 > <https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313> > > Thanks! > Andrew > > > > > Proposal for User Defined PlanNode / Operator API > > August 21, 2020 > > Andrew Lamb > > This document is a high level proposal / request for comments from the > DataFusion community on adding user defined LogicalPlanNodes. > Problem Statement / Rationale > > We are contemplating building a new query engine for a time series related > engine using DataFusion. To do so, we will likely need domain specific > optimizations which are unlikely to be appropriate for a general purpose > engine such as DataFusion because of their specialization. > > Examples of the kinds of optimizations we are thinking of: > > 1. > > Push down (certain) filter predicates and aggregates into the actual > scan over specialized storage structures. > 2. > > Specialized time series specific aggregates that rely on order of the > input rows such as first/last. > > Proposed Solution > > I propose changing LogicalPlan > < > https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773 > > > nodes from an enum of structs which must be defined in the DataFusion > source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR > that demonstrates how such an approach could work: > https://github.com/apache/arrow/pull/8020 > > The primary benefit of such a design over the existing enum of structs is > that users of the DataFusion library can write their own user defined > LogicalPlan nodes and still take advantage of general purpose logic such as > predicate push down. > > A downside of this design is that it will isolate the logic for the > treatment of each type of LogicalPlanNode into its own module for that plan > node. This means that algorithms over LogicalPlan nodes (e.g. predicate > pushdown) will no longer have any node type specific logic in them which > could make them harder to reason about. > > Prior Work: > > I am not familiar enough with the Spark codebase to really understand > how/if Spark allows this, but I think the high level idea is that the > catalyst > < > https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst > > > library provides a similar interfaces called LogicalPlan > < > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29 > > > that defines the operators that are available in Spark SQL > Alternates considered: > > One alternate design we could pursue is to extend the existing > TableProvider trait to support more sophistication (e.g. predicate > pushdown). Adding specialized aggregation operations in the TableProvider > trait seems like it would be confusing to most users who didn’t have > specialized needs to push partial aggregations into the raw scans > > Another alternate design that we could use on our project is to maintain a > fork of the DataFusion code base and simply add our own plan nodes directly > to our fork. We would prefer to avoid this as it will be more expensive to > maintain and we think the user defined API would likely be valuable to > others in the community >