sadboy commented on PR #7775:
URL: 
https://github.com/apache/arrow-datafusion/pull/7775#issuecomment-1856851836

   > we could / should make "don't error check" type constructors for this kind 
of optimization
   
   As a quick and simple solution, that's what I would recommend, yes.
   
   More fundamentally, I think the contention arises from the de-facto "dual
   use" nature of Datafusion's `LogicalPlan`/`Expr` API, which serves two very
   different use cases:
   1. The DataFrame users, who use DataFusion API to programmtically compose
      their queries. In this kind of scenario, you do want to perform eager
      input validation and fail early.
   2. The "analysis" users, who use `LogicalPlan` as an IR. Datafusion's own
      internal use of `LogicalPlan` would mostly fall into this category as
      well, including the SQL compiler, the analyzer/optimizers, logical plan
      serializers, etc. In this kind of scenario, input sanitation on the plan
      node constructors is wasteful as arguments are already known to be
      well-formed.
   
   As things currently stand, the constructor methods in `LogicalPlanBuilder`
   are largely geared to serve the former use case, but widely shared in the
   code paths of the latter. Not an issue when the input queries are small, but
   would definitely cause scalability issues when processing large queries
   (like we've been experiencing). Having unchecked "dumb" constructors would
   be greatly beneficial for perf here, but only if used consistently through
   the whole codebase (not exactly a trivial concern, as in general simply
   having "unchecked" in the method name would discourage people from using
   them).
   
   Ideally, however, I believe these two use cases are different enough that it
   would be beneficial to actually separate them statically at the type level.
   i.e. have a separate type hierarchy for "DataFramePlan"/"DataFrameExpr",
   parallel to `LogicalPlan`/`Expr`. The former would be exclusively used to
   capture "end user" input through the DataFrame API, while the latter would
   be used exclusively as an internal IR, never directly exposed to the end
   user. There would need to be an explicit conversion between the two, but
   that would mostly be mechanical and trivial (and where the input sanitation
   could take place). The benefit is then you entirely eliminate concerns such
   as "should I call the checked or unchecked version of the constructor" when
   dealing with logical plan/exprs. In addition, you could "fine tune" each
   type hierarchy to better fit its purpose. For example, just off the top of
   my head:
   * Plan types such as `Expr::Wildcard` can be removed, as they don't serve
     any purpose in an IR. The effect is that you can greatly cut down the
     "invalid states" in the analyzer/optimizer pipeline, making it much easier
     to perform tree transformations
   * `Expr::Column` can be changed to index-based rather than name-based (i.e.
     `Column { index: usize }`), so you get guaranteed O(1) column resolution
     
   Anyway, that's just my $0.02 🙂 (and as I just realized, probably way off 
topic for
   this thread too, lol)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to