sadboy commented on PR #7775:
URL:
https://github.com/apache/arrow-datafusion/pull/7775#issuecomment-1856851836
> we could / should make "don't error check" type constructors for this kind
of optimization
As a quick and simple solution, that's what I would recommend, yes.
More fundamentally, I think the contention arises from the de-facto "dual
use" nature of Datafusion's `LogicalPlan`/`Expr` API, which serves two very
different use cases:
1. The DataFrame users, who use DataFusion API to programmtically compose
their queries. In this kind of scenario, you do want to perform eager
input validation and fail early.
2. The "analysis" users, who use `LogicalPlan` as an IR. Datafusion's own
internal use of `LogicalPlan` would mostly fall into this category as
well, including the SQL compiler, the analyzer/optimizers, logical plan
serializers, etc. In this kind of scenario, input sanitation on the plan
node constructors is wasteful as arguments are already known to be
well-formed.
As things currently stand, the constructor methods in `LogicalPlanBuilder`
are largely geared to serve the former use case, but widely shared in the
code paths of the latter. Not an issue when the input queries are small, but
would definitely cause scalability issues when processing large queries
(like we've been experiencing). Having unchecked "dumb" constructors would
be greatly beneficial for perf here, but only if used consistently through
the whole codebase (not exactly a trivial concern, as in general simply
having "unchecked" in the method name would discourage people from using
them).
Ideally, however, I believe these two use cases are different enough that it
would be beneficial to actually separate them statically at the type level.
i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
parallel to `LogicalPlan`/`Expr`. The former would be exclusively used to
capture "end user" input through the DataFrame API, while the latter would
be used exclusively as an internal IR, never directly exposed to the end
user. There would need to be an explicit conversion between the two, but
that would mostly be mechanical and trivial (and where the input sanitation
could take place). The benefit is then you entirely eliminate concerns such
as "should I call the checked or unchecked version of the constructor" when
dealing with logical plan/exprs. In addition, you could "fine tune" each
type hierarchy to better fit its purpose. For example, just off the top of
my head:
* Plan types such as `Expr::Wildcard` can be removed, as they don't serve
any purpose in an IR. The effect is that you can greatly cut down the
"invalid states" in the analyzer/optimizer pipeline, making it much easier
to perform tree transformations
* `Expr::Column` can be changed to index-based rather than name-based (i.e.
`Column { index: usize }`), so you get guaranteed O(1) column resolution
Anyway, that's just my $0.02 🙂 (and as I just realized, probably way off
topic for
this thread too, lol)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]