[I] [Feature] Support Spark expression: create_array [datafusion-comet]

via GitHub Wed, 14 Jan 2026 16:59:06 -0800


andygrove opened a new issue, #3150:
URL: https://github.com/apache/datafusion-comet/issues/3150


   ## What is the problem the feature request solves?
   
   > **Note:** This issue was generated with AI assistance. The specification 
details have been extracted from Spark documentation and may need verification.
   
   Comet does not currently support the Spark `create_array` function, causing 
queries using this function to fall back to Spark's JVM execution instead of 
running natively on DataFusion.
   
   The `CreateArray` expression creates an array from a sequence of individual 
elements. It takes multiple input expressions as children and combines them 
into a single array value, handling type coercion to ensure all elements have a 
compatible common type.
   
   Supporting this expression would allow more Spark workloads to benefit from 
Comet's native acceleration.
   
   ## Describe the potential solution
   
   ### Spark Specification
   
   **Syntax:**
   ```sql
   array(expr1, expr2, ..., exprN)
   ```
   
   ```scala
   // DataFrame API
   import org.apache.spark.sql.functions.array
   array(col("col1"), col("col2"), lit(3))
   ```
   
   **Arguments:**
   | Argument | Type | Description |
   |----------|------|-------------|
   | children | Seq[Expression] | Variable number of expressions that will 
become array elements |
   | useStringTypeWhenEmpty | Boolean | Flag determining default element type 
for empty arrays (StringType if true, NullType if false) |
   
   **Return Type:** `ArrayType` with element type determined by finding the 
common type among all input expressions. The array is marked as containing 
nulls if any input expression is nullable.
   
   **Supported Data Types:**
   - All primitive types (numeric, string, boolean, binary, date, timestamp)
   - Complex types (arrays, maps, structs) 
   - All input expressions must have compatible types that can be coerced to a 
common type
   
   **Edge Cases:**
   - **Empty arrays**: Uses `defaultElementType` (StringType or NullType) based 
on `useStringTypeWhenEmpty` flag
   - **Null elements**: Individual null values are preserved as array elements; 
the array itself is never null
   - **Type coercion**: Automatically promotes compatible types (e.g., int and 
long become long array)
   - **Mixed nullability**: Array is marked as containing nulls if any input 
expression is nullable, even if actual values are non-null
   
   **Examples:**
   ```sql
   -- Basic array creation
   SELECT array(1, 2, 3);
   -- Result: [1, 2, 3]
   
   -- Mixed compatible types (promoted to double)
   SELECT array(1, 2.5, 3);
   -- Result: [1.0, 2.5, 3.0]
   
   -- String array
   SELECT array('a', 'b', 'c');
   -- Result: ["a", "b", "c"]
   
   -- Array with nulls
   SELECT array(1, null, 3);
   -- Result: [1, null, 3]
   
   -- Empty array
   SELECT array();
   -- Result: []
   ```
   
   ```scala
   // DataFrame API examples
   import org.apache.spark.sql.functions._
   
   // Create array from column values
   df.select(array(col("col1"), col("col2"), col("col3")))
   
   // Mix columns and literals
   df.select(array(col("name"), lit("constant"), col("other")))
   
   // Nested in other operations
   df.select(array(col("a"), col("b")).alias("arr"))
     .filter(size(col("arr")) > 0)
   ```
   
   ### Implementation Approach
   
   See the [Comet guide on adding new 
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
 for detailed instructions.
   
   1. **Scala Serde**: Add expression handler in 
`spark/src/main/scala/org/apache/comet/serde/`
   2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
   3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if 
needed
   4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has 
built-in support first)
   
   
   ## Additional context
   
   **Difficulty:** Medium
   **Spark Expression Class:** 
`org.apache.spark.sql.catalyst.expressions.CreateArray`
   
   **Related:**
   - `CreateMap` - Creates map structures from key-value pairs
   - `CreateNamedStruct` - Creates struct objects from field name-value pairs
   - `GetArrayItem` - Extracts elements from arrays by index
   - `ArrayContains` - Checks if array contains specific values
   
   ---
   *This issue was auto-generated from Spark reference documentation.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Feature] Support Spark expression: create_array [datafusion-comet]

Reply via email to