[PR] feat: add Spark-compatible arrays_zip function [datafusion]

via GitHub Sat, 23 May 2026 09:01:27 -0700


CuteChuanChuan opened a new pull request, #22473:
URL: https://github.com/apache/datafusion/pull/22473


   Which issue does this PR close?
   
     - Closes #20888.
   
     Rationale for this change
   
     Spark's arrays_zip returns a list of structs whose fields are named with 
0-based ordinals (0, 1, 2, ...), while DataFusion's arrays_zip uses 1-based 
ordinals (1, 2, 3, ...):
   
     spark.sql("select arrays_zip(array(1, 2, 3), array(2, 3, 4))").printSchema
     root
      |-- arrays_zip(...): array (nullable = false)
      |    |-- element: struct (containsNull = false)
      |    |    |-- 0: integer (nullable = true)
      |    |    |-- 1: integer (nullable = true)
   
     To support Spark compatibility without altering DataFusion's native 
semantics, this PR adds a SparkArraysZip wrapper in the datafusion-spark crate.
   
     What changes are included in this PR?
   
     1. Add SparkArraysZip (datafusion/spark/src/function/array/arrays_zip.rs) 
that delegates to ArraysZip and renames the inner struct fields of the returned 
List<Struct<..>> to 0-based ordinals. Both the planning-time DataType and the
     execution-time ArrayRef are renamed; the underlying buffers, offsets, and 
null bitmaps are reused (zero data copy).
     2. Register arrays_zip in the spark array module (mod.rs): 
make_udf_function!, expr_fn, and functions().
     3. Add sqllogictest coverage in 
datafusion/sqllogictest/test_files/spark/array/arrays_zip.slt, mirroring 
scenarios from Spark's DataFrameFunctionsSuite#"dataframe arrays_zip function" 
(df1–df5, df7) and SPARK-24633, plus
     DataFusion-specific LargeList / FixedSizeList cases.
   
     Are these changes tested?
   
     Yes — 15 sqllogictest cases under spark/array/arrays_zip.slt cover:
   
     - Equal-length, unequal-length (NULL padding), and nested arrays
     - Mixed element types (int/string/boolean, byte/double)
     - NULLs inside arrays, NULL list arguments, empty arrays
     - Single-argument and many-argument cases (up to 6 arrays per SPARK-24633)
     - Column-level inputs with NULL rows
     - LargeList and FixedSizeList inputs (DataFusion-specific list flavors)
   
     The existing 1-based arrays_zip sqllogictest in array/arrays_zip.slt 
remains unchanged and continues to pass, confirming no regression in 
DataFusion's native behavior.
   
     Are there any user-facing changes?
   
     Yes — when the datafusion-spark crate is loaded (e.g. via 
SessionStateBuilder::with_spark_features()), arrays_zip produces structs with 
0-based field names instead of 1-based. No breaking changes to DataFusion's 
native arrays_zip.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: add Spark-compatible arrays_zip function [datafusion]

Reply via email to