zhuqi-lucas opened a new issue, #22732:
URL: https://github.com/apache/datafusion/issues/22732

   ## Describe the bug / opportunity
   
   `LogicalPlan` is 320 bytes on the stack today, but the typical 
query-execution path never produces the variants that drive that size. The 
`Ddl(DdlStatement)` variant is the offender: it carries `CreateExternalTable` 
(312 bytes) and `CreateFunction` (288 bytes), and the enum-size rule 
(`max(variant) + tag`) forces the whole `LogicalPlan` enum to the same width on 
every code path — including SELECT queries that will never instantiate a DDL 
node.
   
   This shows up directly on the planning hot path. Profiling `sql_planner` 
(samply, `logical_plan_tpch_all`) on macOS aarch64:
   
   ```
   55%  in sql_planner binary (DataFusion + Rust stdlib)
   31%  libsystem_malloc.dylib  (malloc / free / realloc)
   13%  libsystem_platform.dylib (memcpy / memmove)
    1%  other (kernel, dyld, pthread)
   ```
   
   A non-trivial share of the 13% memcpy/memmove time is `LogicalPlan` moves: 
every `std::mem::take` in the optimizer's in-place rewriters, every owned-API 
`LogicalPlan::map_*`, every `Arc<LogicalPlan>` write currently shuffles 320 
bytes, even when the loaded variant is something small like `Projection` (40 
bytes) or `Filter` (128 bytes).
   
   ### Per-variant sizes
   
   ```
   === LogicalPlan enum total ===
      320 bytes  LogicalPlan
   === Per-variant inner struct ===
       40 bytes  Projection
      128 bytes  Filter
       40 bytes  Window
       64 bytes  Aggregate
       48 bytes  Sort
      176 bytes  Join
       40 bytes  Repartition
       32 bytes  Union
       56 bytes  Subquery
       72 bytes  SubqueryAlias
       24 bytes  Limit
       88 bytes  Distinct
       16 bytes  Extension
       56 bytes  RecursiveQuery
       48 bytes  Analyze
       48 bytes  Explain
      168 bytes  TableScan
       32 bytes  Values
      144 bytes  Unnest
       96 bytes  DmlStatement
      120 bytes  CreateMemoryTable
       96 bytes  CreateView
       88 bytes  DistinctOn
       56 bytes  Statement
      320 bytes  DdlStatement       <-- forces LogicalPlan to 320
       16 bytes  EmptyRelation
       16 bytes  DescribeTable
   
   === Inside DdlStatement ===
      312 bytes  CreateExternalTable  <-- dominates DdlStatement
      288 bytes  CreateFunction       <-- second-largest
      144 bytes  CreateIndex
       72 bytes  DropTable / DropView
       48 bytes  DropCatalogSchema
       40 bytes  CreateCatalog / CreateCatalogSchema / DropFunction
   ```
   
   If `CreateExternalTable` and `CreateFunction` are `Box`ed inside 
`DdlStatement`, the max DDL variant drops to `CreateIndex` at 144 bytes, the 
max `LogicalPlan` variant becomes `Join` at 176, and `LogicalPlan` shrinks to 
**176 bytes (–45%)** — the enum discriminant fits inside `Join`'s alignment 
padding, so `LogicalPlan` ends up the same width as `Join` itself. Paid for by 
one heap allocation per DDL plan, which is negligible because DDL plans are not 
on the per-query hot path.
   
   ## To Reproduce
   
   ```rust
   // in datafusion/expr, with all relevant types in scope:
   println!("{}", std::mem::size_of::<LogicalPlan>());         // 320
   println!("{}", std::mem::size_of::<DdlStatement>());        // 320
   println!("{}", std::mem::size_of::<CreateExternalTable>()); // 312
   println!("{}", std::mem::size_of::<CreateFunction>());      // 288
   ```
   
   ## Expected behavior
   
   `LogicalPlan` should not be sized by variants that never appear on the query 
path. Moving the two outsized DDL variants behind a `Box` brings `LogicalPlan` 
to a size driven by `Join` (176 bytes), which is paid by every plan node on 
every query.
   
   ## Additional context
   
   Local `cargo bench -p datafusion --bench sql_planner --quick` on macOS 
aarch64, comparing main vs. boxed DDL variants:
   
   | bench | main | boxed | delta |
   |---|---|---|---|
   | `optimizer_tpch_all` | 8.61 ms | 8.18 ms | **–5.0%** |
   | `optimizer_tpcds_all` | 168.0 ms | 163.5 ms | **–2.7%** |
   
   Smaller benches (sub-200 µs) are within `--quick` noise.
   
   CI bench on the GKE aarch64 runner should give a tighter signal; willing to 
open a draft PR so a maintainer can trigger it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to