[I] Add randomised fuzz harness for pyarrow UDF vector-copy path [datafusion-comet]

via GitHub Thu, 21 May 2026 08:40:28 -0700


andygrove opened a new issue, #4384:
URL: https://github.com/apache/datafusion-comet/issues/4384


   ## What is the problem the feature request solves?
   
   PR #4234 introduces `CometColumnarPythonInput.copyVector`, which walks 
Comet's `FieldVector` tree in lockstep with a destination IPC root and 
`setBytes`-copies each Arrow buffer, then sets value counts bottom-up so offset 
bytes aren't rewritten. This is exactly the kind of recursive bytewise walk 
where hand-written cases catch the obvious bugs and miss the long tail.
   
   `CometCodegenFuzzSuite` (#4267) already does the equivalent job for the 
codegen path: random schema, random data, run an identity UDF over every 
column, assert row-equivalence. The same pattern translates directly to the 
pyarrow UDF path.
   
   ## Describe the potential solution
   
   A pytest module under `spark/src/test/resources/pyspark/` (alongside 
`test_pyarrow_udf.py`) that:
   
   1. **Generates random schemas.** A type generator that draws from primitives 
(Long, Int, Short, Byte, Boolean, Float, Double, Date, Timestamp, TimestampNTZ, 
String, Binary, Decimal with varied precision/scale) and recurses into 
ArrayType / StructType / MapType with bounded depth. Seeded for reproducibility.
   2. **Generates random data.** Per column, draw a null fraction from `{0, 
0.01, 0.5, 0.99, 1.0}` and fill with random values matching the type. Length 
distribution for variable-width and list types should cover the small + large 
extremes.
   3. **Writes parquet, reads it back through `mapInArrow(passthrough)`**, 
asserts row-equivalence between accelerated and fallback modes.
   4. **Forces multiple Arrow IPC batches per partition** by setting 
`spark.sql.execution.arrow.maxRecordsPerBatch` low, so the persistent 
destination IPC root is exercised across batches in every run.
   
   This closes three of the test gaps from @mbutrovich's review on #4234 (items 
1, 3, 4 from the test section) in one move: the harness exercises recursive 
vector-tree walks, validity-bit handling across null densities, and multi-batch 
per partition without needing per-shape hand-written cases.
   
   The targeted hand-written cases already added in PR #4234 (decimal precision 
sweep, null density sweep, multi-batch, wide schema, mid-stream empty batch, 
transforming array UDF) stay — they're cheap to keep and they pin the 
boundaries that fuzz won't find reliably (the precision sweep, for example, 
deterministically hits the 18/19 boundary that fuzz might happen to skip).
   
   ## Additional context
   
   - Reference test fixture: 
`spark/src/test/resources/pyspark/test_pyarrow_udf.py` (PR #4234) — 
accelerated/fallback parametrisation pattern, jar resolution, session setup all 
reusable.
   - Related: PR #4234 (introduces the operator), #4383 (drops the per-batch 
buffer copy on the same code path — fuzz harness gains value once that's 
landed, since the new direct-from-Comet path has its own walk to validate), 
`CometCodegenFuzzSuite` (#4267, harness shape to borrow from).
   - Originally surfaced as the first item in @mbutrovich's "Tests" section on 
#4234.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add randomised fuzz harness for pyarrow UDF vector-copy path [datafusion-comet]

Reply via email to