andygrove opened a new issue, #4384:
URL: https://github.com/apache/datafusion-comet/issues/4384
## What is the problem the feature request solves?
PR #4234 introduces `CometColumnarPythonInput.copyVector`, which walks
Comet's `FieldVector` tree in lockstep with a destination IPC root and
`setBytes`-copies each Arrow buffer, then sets value counts bottom-up so offset
bytes aren't rewritten. This is exactly the kind of recursive bytewise walk
where hand-written cases catch the obvious bugs and miss the long tail.
`CometCodegenFuzzSuite` (#4267) already does the equivalent job for the
codegen path: random schema, random data, run an identity UDF over every
column, assert row-equivalence. The same pattern translates directly to the
pyarrow UDF path.
## Describe the potential solution
A pytest module under `spark/src/test/resources/pyspark/` (alongside
`test_pyarrow_udf.py`) that:
1. **Generates random schemas.** A type generator that draws from primitives
(Long, Int, Short, Byte, Boolean, Float, Double, Date, Timestamp, TimestampNTZ,
String, Binary, Decimal with varied precision/scale) and recurses into
ArrayType / StructType / MapType with bounded depth. Seeded for reproducibility.
2. **Generates random data.** Per column, draw a null fraction from `{0,
0.01, 0.5, 0.99, 1.0}` and fill with random values matching the type. Length
distribution for variable-width and list types should cover the small + large
extremes.
3. **Writes parquet, reads it back through `mapInArrow(passthrough)`**,
asserts row-equivalence between accelerated and fallback modes.
4. **Forces multiple Arrow IPC batches per partition** by setting
`spark.sql.execution.arrow.maxRecordsPerBatch` low, so the persistent
destination IPC root is exercised across batches in every run.
This closes three of the test gaps from @mbutrovich's review on #4234 (items
1, 3, 4 from the test section) in one move: the harness exercises recursive
vector-tree walks, validity-bit handling across null densities, and multi-batch
per partition without needing per-shape hand-written cases.
The targeted hand-written cases already added in PR #4234 (decimal precision
sweep, null density sweep, multi-batch, wide schema, mid-stream empty batch,
transforming array UDF) stay — they're cheap to keep and they pin the
boundaries that fuzz won't find reliably (the precision sweep, for example,
deterministically hits the 18/19 boundary that fuzz might happen to skip).
## Additional context
- Reference test fixture:
`spark/src/test/resources/pyspark/test_pyarrow_udf.py` (PR #4234) —
accelerated/fallback parametrisation pattern, jar resolution, session setup all
reusable.
- Related: PR #4234 (introduces the operator), #4383 (drops the per-batch
buffer copy on the same code path — fuzz harness gains value once that's
landed, since the new direct-from-Comet path has its own walk to validate),
`CometCodegenFuzzSuite` (#4267, harness shape to borrow from).
- Originally surfaced as the first item in @mbutrovich's "Tests" section on
#4234.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]