Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

Jorge Cardoso Leitão Sat, 22 Aug 2020 00:41:39 -0700

Hi Andrew,

I carefully went through the document and the PR. Thank you for this!


I believe that the improvements on the PR alone are a major benefit, as it
supports custom logical plans out of the box, which opens a lot of
possibilities to users.

I also like the idea of migrating from enum to a trait `LogicalPlanNode`
for the advantages that you mentioned. I am undecided whether we should go
for it because we may have to trade-off node-specific optimizations: I
currently can't tell whether that trade-off really exists, or whether we
are able to add trait methods (such as
`prevent_predicate_push_down_columns` in your PR) that allows to achieve
full optimizations: I would need to implement this to some of our nodes to
check if we could land on a good end result, which of course entails the
risk that we may scrap it if we can't arrive at one. Maybe @Andy Grove
<andygrov...@gmail.com> or others can formulate a better judgment call here.

In summary, I very much agree with the content of the proposal, but I
currently do not have sufficient information to evaluate the tradeoffs. One
concrete path is to try to migrate a difficult node, e.g. Aggregate, and
see how difficult it is to make our current tests pass.

Best,
Jorge


On Fri, Aug 21, 2020 at 11:37 PM Andrew Lamb <al...@influxdata.com> wrote:

> I would like to propose and request comments from the DataFusion community
> on adding user defined LogicalPlanNodes.
>
> A detailed proposal is in this google doc
> <
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> >
> (comments
> enabled, copy/pasted below for your convenience).
>
> Here is a PR  showing the approach how it could work:
> https://github.com/apache/arrow/pull/8020
> <https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313>
>
> Thanks!
> Andrew
>
>
>
>
> Proposal for User Defined PlanNode / Operator API
>
> August 21, 2020
>
> Andrew Lamb
>
> This document is a high level proposal / request for comments from the
> DataFusion community on adding user defined LogicalPlanNodes.
> Problem Statement / Rationale
>
> We are contemplating building a new query engine for a time series related
> engine using DataFusion. To do so, we will likely need domain specific
> optimizations which are unlikely to be appropriate for a general purpose
> engine such as DataFusion because of their specialization.
>
> Examples of the kinds of optimizations we are thinking of:
>
>    1.
>
>    Push down (certain) filter predicates and aggregates into the actual
>    scan over specialized storage structures.
>    2.
>
>    Specialized time series specific aggregates that rely on order of the
>    input rows such as first/last.
>
> Proposed Solution
>
> I propose changing LogicalPlan
> <
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773
> >
> nodes from an enum of structs which must be defined in the DataFusion
> source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR
> that demonstrates how such an approach could work:
> https://github.com/apache/arrow/pull/8020
>
> The primary benefit of such a design over the existing enum of structs is
> that users of the DataFusion library can write their own user defined
> LogicalPlan nodes and still take advantage of general purpose logic such as
> predicate push down.
>
> A downside of this design is that it will isolate the logic for the
> treatment of each type of LogicalPlanNode into its own module for that plan
> node. This means that algorithms over LogicalPlan nodes (e.g. predicate
> pushdown) will no longer have any node type specific logic in them which
> could make them harder to reason about.
>
> Prior Work:
>
> I am not familiar enough with the Spark codebase to really understand
> how/if Spark allows this, but I think the high level idea is that the
> catalyst
> <
> https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst
> >
> library provides a similar interfaces called LogicalPlan
> <
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29
> >
> that defines the operators that are available in Spark SQL
> Alternates considered:
>
> One alternate design we could pursue is to extend the existing
> TableProvider trait to support more sophistication (e.g. predicate
> pushdown). Adding specialized aggregation operations in the TableProvider
> trait seems like it would be confusing to most users who didn’t have
> specialized needs to push partial aggregations into the raw scans
>
> Another alternate design that we could use on our project is to maintain a
> fork of the DataFusion code base and simply add our own plan nodes directly
> to our fork. We would prefer to avoid this as it will be more expensive to
> maintain and we think the user defined API would likely be valuable to
> others in the community
>

Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

Reply via email to