This is an automated email from the ASF dual-hosted git repository.

github-merge-queue[bot] pushed a commit to branch 
gh-readonly-queue/main/pr-22733-e1d8d463b51e67e777b3ef744e80fa75593b3e5b
in repository https://gitbox.apache.org/repos/asf/datafusion.git

commit 2ec0ab50e39970592c8e7d25b330c92c68c3fca8
Author: Qi Zhu <[email protected]>
AuthorDate: Fri Jun 5 15:41:04 2026 +0800

    perf(logical-plan): box CreateExternalTable / CreateFunction in 
DdlStatement (-45% LogicalPlan size) (#22733)
    
    ## Which issue does this PR close?
    
    Closes #22732.
    
    ## Rationale for this change
    
    `LogicalPlan` is 320 bytes today, sized by the `Ddl(DdlStatement)`
    variant — which is itself 320 bytes because it carries
    `CreateExternalTable` (312 bytes) and `CreateFunction` (288 bytes).
    Every non-DDL variant is at most 176 bytes (`Join`). Every SELECT query
    pays the full 320-byte payload on every `mem::take` / `mem::swap` /
    `Arc<LogicalPlan>` write during planning, even though it never
    instantiates a DDL node.
    
    Profiling `sql_planner` (samply, macOS aarch64) showed ~13% of CPU in
    `libsystem_platform.dylib` (memcpy / memmove). Shrinking the most
    frequently-moved type directly attacks that pool.
    
    ## What changes are included in this PR?
    
    Box the two oversized DDL variants:
    
    ```rust
    pub enum DdlStatement {
        CreateExternalTable(Box<CreateExternalTable>),
        // …
        CreateFunction(Box<CreateFunction>),
        // …
    }
    ```
    
    After this change:
    
    - `DdlStatement` drops from **320 → ~152 bytes** (max variant is now
    `CreateIndex` at 144).
    - `LogicalPlan` drops from **320 → 176 bytes (–45%)** — the enum
    discriminant fits inside `Join`'s alignment padding, so the enum is
    exactly the same width as `Join`.
    
    DDL plan construction takes one extra `Box::new(...)` allocation per DDL
    statement — negligible because DDL plans are one-shot and not on the
    per-query hot path.
    
    The diff is mechanical: construction sites add `Box::new(...)`, the few
    field-destructuring patterns are converted to a `let CreateExternalTable
    { … } = ce.as_ref();` shape, and the FFI/proto crates need one `*cmd` /
    `Box::new(cmd)` adjustment at the type boundaries.
    
    A new `test_size_of_logical_plan` unit test pins
    `size_of::<LogicalPlan>() == 176` and asserts `DdlStatement` stays
    smaller than `Join`, so future variant growth that would re-balloon the
    enum trips the test rather than silently regressing the planning hot
    path. (Same shape as the existing `test_size_of_expr` in `expr.rs`.)
    
    ## Are these changes tested?
    
    - 711 optimizer + 217 expr + 86 sql + 7 proto unit tests pass.
    - SLT `create_external_table` / `create_function` / `ddl` pass.
    - `cargo clippy --workspace --all-targets -- -D warnings` clean.
    
    
    ## Are there any user-facing changes?
    
    Yes (semantic API only). Code that pattern-matches
    `DdlStatement::CreateExternalTable(CreateExternalTable { … })` must
    change to bind the box and deref. Code that constructs these variants
    must wrap with `Box::new(...)`. No behavioral change.
---
 datafusion/core/src/execution/context/mod.rs       |   2 +-
 datafusion/expr/src/logical_plan/ddl.rs            |  28 +++---
 datafusion/expr/src/logical_plan/plan.rs           |  39 ++++++++
 datafusion/ffi/src/table_provider_factory.rs       |   4 +-
 datafusion/proto/src/logical_plan/mod.rs           |  51 +++++-----
 datafusion/sql/src/statement.rs                    |  30 +++---
 docs/source/library-user-guide/upgrading/55.0.0.md | 106 +++++++++++++++++++++
 7 files changed, 205 insertions(+), 55 deletions(-)

diff --git a/datafusion/core/src/execution/context/mod.rs 
b/datafusion/core/src/execution/context/mod.rs
index b2ad5c7d7a..189206a711 100644
--- a/datafusion/core/src/execution/context/mod.rs
+++ b/datafusion/core/src/execution/context/mod.rs
@@ -714,7 +714,7 @@ impl SessionContext {
                         Box::pin(self.drop_schema(cmd)).await
                     }
                     DdlStatement::CreateFunction(cmd) => {
-                        Box::pin(self.create_function(cmd)).await
+                        Box::pin(self.create_function(*cmd)).await
                     }
                     DdlStatement::DropFunction(cmd) => {
                         Box::pin(self.drop_function(cmd)).await
diff --git a/datafusion/expr/src/logical_plan/ddl.rs 
b/datafusion/expr/src/logical_plan/ddl.rs
index 5779fb0c4e..1990a31edb 100644
--- a/datafusion/expr/src/logical_plan/ddl.rs
+++ b/datafusion/expr/src/logical_plan/ddl.rs
@@ -38,8 +38,10 @@ use sqlparser::ast::Ident;
 /// Various types of DDL  (CREATE / DROP) catalog manipulation
 #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
 pub enum DdlStatement {
-    /// Creates an external table.
-    CreateExternalTable(CreateExternalTable),
+    /// Creates an external table. Boxed to keep `LogicalPlan` enum size down
+    /// — `CreateExternalTable` is ~312 bytes, dwarfing every other variant
+    /// in the plan tree and forcing the whole enum to that width.
+    CreateExternalTable(Box<CreateExternalTable>),
     /// Creates an in memory table.
     CreateMemoryTable(CreateMemoryTable),
     /// Creates a new view.
@@ -56,8 +58,9 @@ pub enum DdlStatement {
     DropView(DropView),
     /// Drops a catalog schema
     DropCatalogSchema(DropCatalogSchema),
-    /// Create function statement
-    CreateFunction(CreateFunction),
+    /// Create function statement. Boxed for the same reason as
+    /// [`Self::CreateExternalTable`] (~288 bytes).
+    CreateFunction(Box<CreateFunction>),
     /// Drop function statement
     DropFunction(DropFunction),
 }
@@ -66,9 +69,7 @@ impl DdlStatement {
     /// Get a reference to the logical plan's schema
     pub fn schema(&self) -> &DFSchemaRef {
         match self {
-            DdlStatement::CreateExternalTable(CreateExternalTable { schema, .. 
}) => {
-                schema
-            }
+            DdlStatement::CreateExternalTable(ce) => &ce.schema,
             DdlStatement::CreateMemoryTable(CreateMemoryTable { input, .. })
             | DdlStatement::CreateView(CreateView { input, .. }) => 
input.schema(),
             DdlStatement::CreateCatalogSchema(CreateCatalogSchema { schema, .. 
}) => {
@@ -79,7 +80,7 @@ impl DdlStatement {
             DdlStatement::DropTable(DropTable { schema, .. }) => schema,
             DdlStatement::DropView(DropView { schema, .. }) => schema,
             DdlStatement::DropCatalogSchema(DropCatalogSchema { schema, .. }) 
=> schema,
-            DdlStatement::CreateFunction(CreateFunction { schema, .. }) => 
schema,
+            DdlStatement::CreateFunction(cf) => &cf.schema,
             DdlStatement::DropFunction(DropFunction { schema, .. }) => schema,
         }
     }
@@ -131,11 +132,9 @@ impl DdlStatement {
         impl Display for Wrapper<'_> {
             fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
                 match self.0 {
-                    DdlStatement::CreateExternalTable(CreateExternalTable {
-                        name,
-                        constraints,
-                        ..
-                    }) => {
+                    DdlStatement::CreateExternalTable(ce) => {
+                        let name = &ce.name;
+                        let constraints = &ce.constraints;
                         if constraints.is_empty() {
                             write!(f, "CreateExternalTable: {name:?}")
                         } else {
@@ -191,7 +190,8 @@ impl DdlStatement {
                             "DropCatalogSchema: {name:?} if not 
exist:={if_exists} cascade:={cascade}"
                         )
                     }
-                    DdlStatement::CreateFunction(CreateFunction { name, .. }) 
=> {
+                    DdlStatement::CreateFunction(cf) => {
+                        let name = &cf.name;
                         write!(f, "CreateFunction: name {name:?}")
                     }
                     DdlStatement::DropFunction(DropFunction { name, .. }) => {
diff --git a/datafusion/expr/src/logical_plan/plan.rs 
b/datafusion/expr/src/logical_plan/plan.rs
index 1bfecd06c2..b884395386 100644
--- a/datafusion/expr/src/logical_plan/plan.rs
+++ b/datafusion/expr/src/logical_plan/plan.rs
@@ -4740,6 +4740,45 @@ mod tests {
     use insta::{assert_debug_snapshot, assert_snapshot};
     use std::hash::DefaultHasher;
 
+    /// `LogicalPlan` is moved/swapped on every step of the planning hot path
+    /// (every `mem::take` in an in-place rewriter, every `Arc<LogicalPlan>`
+    /// write, every owned `map_*` traversal). Its size is set by the largest
+    /// variant, so an oversized variant balloons cost for every other variant.
+    ///
+    /// Today the size-setter should be `Join` (~176 bytes); `DdlStatement` is
+    /// boxed precisely so it does not dominate. If you grow a variant, please
+    /// box the new large fields rather than letting this number creep up —
+    /// see the analogous `test_size_of_expr` in `expr.rs`.
+    #[test]
+    fn test_size_of_logical_plan() {
+        // `LogicalPlan` enum on aarch64 / x86_64. Today this matches
+        // `Join`'s 176 bytes (the enum discriminant fits in `Join`'s
+        // alignment padding); if `Join` grows or another variant overtakes
+        // it, this number will move with the new size-setter.
+        assert_eq!(size_of::<LogicalPlan>(), 176);
+        // `DdlStatement` is `Ddl(DdlStatement)`'s payload; keep it below the
+        // `Join` ceiling so it never re-becomes the size-setter.
+        assert!(
+            size_of::<DdlStatement>() < size_of::<Join>(),
+            "DdlStatement ({} bytes) should stay smaller than Join ({} bytes); 
\
+             box the new large variant rather than letting it dominate 
`LogicalPlan`.",
+            size_of::<DdlStatement>(),
+            size_of::<Join>(),
+        );
+        // Sanity check the two boxed variants stay boxed (so the payload
+        // sits on the heap, not in the enum).
+        assert_eq!(
+            size_of::<Box<crate::CreateExternalTable>>(),
+            8,
+            "CreateExternalTable should be Box'd inside DdlStatement"
+        );
+        assert_eq!(
+            size_of::<Box<crate::CreateFunction>>(),
+            8,
+            "CreateFunction should be Box'd inside DdlStatement"
+        );
+    }
+
     fn employee_schema() -> Schema {
         Schema::new(vec![
             Field::new("id", DataType::Int32, false),
diff --git a/datafusion/ffi/src/table_provider_factory.rs 
b/datafusion/ffi/src/table_provider_factory.rs
index 3ce8841614..466b56806d 100644
--- a/datafusion/ffi/src/table_provider_factory.rs
+++ b/datafusion/ffi/src/table_provider_factory.rs
@@ -152,7 +152,7 @@ impl FFI_TableProviderFactory {
         let plan = LogicalPlanNode::decode(cmd_serialized.as_ref())
             .map_err(|e| DataFusionError::Internal(format!("{e:?}")))?;
         match plan.try_into_logical_plan(&task_ctx, logical_codec.as_ref())? {
-            LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => 
Ok(cmd),
+            LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => 
Ok(*cmd),
             _ => Err(DataFusionError::Internal(
                 "Invalid logical plan in FFI_TableProviderFactory.".to_owned(),
             )),
@@ -272,7 +272,7 @@ impl ForeignTableProviderFactory {
         let logical_codec: Arc<dyn LogicalExtensionCodec> =
             (&self.0.logical_codec).into();
 
-        let plan = LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd));
+        let plan = 
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(Box::new(cmd)));
         let plan: LogicalPlanNode =
             AsLogicalPlan::try_from_logical_plan(&plan, 
logical_codec.as_ref())?;
 
diff --git a/datafusion/proto/src/logical_plan/mod.rs 
b/datafusion/proto/src/logical_plan/mod.rs
index 49593a6c6a..b691441e95 100644
--- a/datafusion/proto/src/logical_plan/mod.rs
+++ b/datafusion/proto/src/logical_plan/mod.rs
@@ -790,26 +790,30 @@ impl AsLogicalPlan for LogicalPlanNode {
                 }
 
                 Ok(LogicalPlan::Ddl(DdlStatement::CreateExternalTable(
-                    CreateExternalTable::builder(
-                        from_table_reference(
-                            create_extern_table.name.as_ref(),
-                            "CreateExternalTable",
-                        )?,
-                        create_extern_table.location.clone(),
-                        create_extern_table.file_type.clone(),
-                        pb_schema.try_into()?,
-                    )
-                    
.with_partition_cols(create_extern_table.table_partition_cols.clone())
-                    .with_order_exprs(order_exprs)
-                    .with_if_not_exists(create_extern_table.if_not_exists)
-                    .with_or_replace(create_extern_table.or_replace)
-                    .with_temporary(create_extern_table.temporary)
-                    .with_definition(definition)
-                    .with_unbounded(create_extern_table.unbounded)
-                    .with_options(create_extern_table.options.clone())
-                    .with_constraints(constraints.into())
-                    .with_column_defaults(column_defaults)
-                    .build(),
+                    Box::new(
+                        CreateExternalTable::builder(
+                            from_table_reference(
+                                create_extern_table.name.as_ref(),
+                                "CreateExternalTable",
+                            )?,
+                            create_extern_table.location.clone(),
+                            create_extern_table.file_type.clone(),
+                            pb_schema.try_into()?,
+                        )
+                        .with_partition_cols(
+                            create_extern_table.table_partition_cols.clone(),
+                        )
+                        .with_order_exprs(order_exprs)
+                        .with_if_not_exists(create_extern_table.if_not_exists)
+                        .with_or_replace(create_extern_table.or_replace)
+                        .with_temporary(create_extern_table.temporary)
+                        .with_definition(definition)
+                        .with_unbounded(create_extern_table.unbounded)
+                        .with_options(create_extern_table.options.clone())
+                        .with_constraints(constraints.into())
+                        .with_column_defaults(column_defaults)
+                        .build(),
+                    ),
                 )))
             }
             LogicalPlanType::CreateView(create_view) => {
@@ -1774,8 +1778,8 @@ impl AsLogicalPlan for LogicalPlanNode {
                     },
                 )),
             }),
-            LogicalPlan::Ddl(DdlStatement::CreateExternalTable(
-                CreateExternalTable {
+            LogicalPlan::Ddl(DdlStatement::CreateExternalTable(ce)) => {
+                let CreateExternalTable {
                     name,
                     location,
                     file_type,
@@ -1790,8 +1794,7 @@ impl AsLogicalPlan for LogicalPlanNode {
                     constraints,
                     column_defaults,
                     temporary,
-                },
-            )) => {
+                } = ce.as_ref();
                 let mut converted_order_exprs: Vec<SortExprNodeCollection> = 
vec![];
                 for order in order_exprs {
                     let temp = SortExprNodeCollection {
diff --git a/datafusion/sql/src/statement.rs b/datafusion/sql/src/statement.rs
index 7da1c061cd..401313f9d3 100644
--- a/datafusion/sql/src/statement.rs
+++ b/datafusion/sql/src/statement.rs
@@ -1468,7 +1468,7 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
                     function_body,
                 };
 
-                let statement = DdlStatement::CreateFunction(CreateFunction {
+                let statement = 
DdlStatement::CreateFunction(Box::new(CreateFunction {
                     or_replace,
                     temporary,
                     name,
@@ -1476,7 +1476,7 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
                     args,
                     params,
                     schema: DFSchemaRef::new(DFSchema::empty()),
-                });
+                }));
 
                 Ok(LogicalPlan::Ddl(statement))
             }
@@ -1855,18 +1855,20 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
         let constraints =
             self.new_constraint_from_table_constraints(&all_constraints, 
&df_schema)?;
         Ok(LogicalPlan::Ddl(DdlStatement::CreateExternalTable(
-            PlanCreateExternalTable::builder(name, location, file_type, 
df_schema)
-                .with_partition_cols(table_partition_cols)
-                .with_if_not_exists(if_not_exists)
-                .with_or_replace(or_replace)
-                .with_temporary(temporary)
-                .with_definition(definition)
-                .with_order_exprs(ordered_exprs)
-                .with_unbounded(unbounded)
-                .with_options(options_map)
-                .with_constraints(constraints)
-                .with_column_defaults(column_defaults)
-                .build(),
+            Box::new(
+                PlanCreateExternalTable::builder(name, location, file_type, 
df_schema)
+                    .with_partition_cols(table_partition_cols)
+                    .with_if_not_exists(if_not_exists)
+                    .with_or_replace(or_replace)
+                    .with_temporary(temporary)
+                    .with_definition(definition)
+                    .with_order_exprs(ordered_exprs)
+                    .with_unbounded(unbounded)
+                    .with_options(options_map)
+                    .with_constraints(constraints)
+                    .with_column_defaults(column_defaults)
+                    .build(),
+            ),
         )))
     }
 
diff --git a/docs/source/library-user-guide/upgrading/55.0.0.md 
b/docs/source/library-user-guide/upgrading/55.0.0.md
index d3988c33a4..ad50a37cb9 100644
--- a/docs/source/library-user-guide/upgrading/55.0.0.md
+++ b/docs/source/library-user-guide/upgrading/55.0.0.md
@@ -97,6 +97,112 @@ as a supertrait:
 + pub trait QueryPlanner: Any + Debug
 ```
 
+### `DdlStatement::CreateExternalTable` and `CreateFunction` are now boxed
+
+The two largest variants of `datafusion_expr::DdlStatement` are now
+`Box`ed:
+
+```rust,ignore
+// Before
+pub enum DdlStatement {
+    CreateExternalTable(CreateExternalTable),
+    // ...
+    CreateFunction(CreateFunction),
+    // ...
+}
+
+// After
+pub enum DdlStatement {
+    CreateExternalTable(Box<CreateExternalTable>),
+    // ...
+    CreateFunction(Box<CreateFunction>),
+    // ...
+}
+```
+
+`CreateExternalTable` is 312 bytes and `CreateFunction` is 288 bytes, so
+without boxing they forced the entire `LogicalPlan` enum to 320 bytes
+even on SELECT-only query paths that never instantiate them. Boxing
+shrinks `LogicalPlan` from 320 → 176 bytes (−45%), making every
+`mem::take` / `mem::swap` / `Arc<LogicalPlan>` store on the planning
+hot path move a smaller payload.
+
+**Who is affected:**
+
+- Users who construct `DdlStatement::CreateExternalTable(...)` or
+  `DdlStatement::CreateFunction(...)` from an owned struct.
+- Users who pattern-match these variants and destructure the inner
+  struct in the same pattern (e.g.
+  `DdlStatement::CreateExternalTable(CreateExternalTable { name, .. })`).
+- Code that consumes the inner struct out of these variants (e.g. to
+  pass `CreateExternalTable` by value to another function).
+
+**Migration guide:**
+
+When constructing the variants, wrap the inner struct in `Box::new`:
+
+```rust,ignore
+// Before
+let stmt = DdlStatement::CreateFunction(CreateFunction { name, args, .. });
+
+// After
+let stmt = DdlStatement::CreateFunction(Box::new(CreateFunction {
+    name,
+    args,
+    ..
+}));
+```
+
+When pattern-matching, bind the boxed value and either access fields
+through it (Rust auto-derefs the `Box`) or destructure via `.as_ref()`:
+
+```rust,ignore
+// Before
+match ddl {
+    DdlStatement::CreateExternalTable(CreateExternalTable {
+        name, location, ..
+    }) => { /* use name, location */ }
+}
+
+// After — access fields through the box
+match ddl {
+    DdlStatement::CreateExternalTable(ce) => {
+        let name = &ce.name;
+        let location = &ce.location;
+        /* ... */
+    }
+}
+
+// After — destructure the dereferenced struct
+match ddl {
+    DdlStatement::CreateExternalTable(ce) => {
+        let CreateExternalTable { name, location, .. } = ce.as_ref();
+        /* ... */
+    }
+}
+```
+
+When you need an owned `CreateExternalTable` / `CreateFunction` out of
+the variant, dereference the box with `*`:
+
+```rust,ignore
+// Before
+match plan {
+    LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(cmd),
+    _ => { /* ... */ }
+}
+
+// After
+match plan {
+    LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(*cmd),
+    _ => { /* ... */ }
+}
+```
+
+See [PR #22733](https://github.com/apache/datafusion/pull/22733) for
+details, including the per-variant size breakdown and benchmark
+results.
+
 ### Spark map functions now reject duplicate keys by default
 
 The Spark-compatibility map-construction functions (`map_from_arrays`,


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to