mbutrovich opened a new pull request, #4549:
URL: https://github.com/apache/datafusion-comet/pull/4549
## Which issue does this PR close?
Closes #4539.
## Rationale for this change
Routing a map- or array-typed expression through the JVM codegen dispatcher
could corrupt the output. For example, `map_concat(map(1,'a',2,'b'),
map(3,'c'))` produced `{1->a, 2->b, 0->c}` (last key corrupted to 0). The query
runs natively with no fallback, so this is a wrong-answer bug.
Root cause: fixed-width scalar writes into nested collection children used
Arrow's `set`, which does not grow the buffer. `allocateOutput` sizes a vector
from `numRows`, but a collection child's element count is the data-dependent
sum of per-row collection sizes, not bounded by `numRows`. The total is also
unknown until the per-row `ev.code` has run inside the write loop, so the child
cannot be presized. Once a row held more entries than the presized child, the
write ran off the end of the buffer.
Comet enables Arrow unsafe memory access
(`arrow.enable_unsafe_memory_access=true`, set by `NativeBase`), so that
out-of-bounds `set` is not checked at runtime: it writes past the child and the
value reads back as 0. With bounds checking on (e.g. the kernel unit test), the
same write throws `IndexOutOfBoundsException`. Same bug, two symptoms.
The value side (String) already used `setSafe`, which is why values survived
and keys did not.
## What changes are included in this PR?
- `emitWrite` takes a `nested` flag. Fixed-width Boolean/numeric/temporal
writes use `setSafe` when nested and the `set` fast path otherwise.
- Call sites:
- array elements and map key/value pass `nested = true` (always written at
a cumulative index).
- struct fields propagate the struct's own `nested` (fields are co-indexed
with the struct, so a field is nested exactly when the struct is).
- the root output stays `set` (presized to `numRows`, one scalar per row).
- Variable-width (String/Binary) and Decimal writes already used `setSafe`
and are unchanged.
## How are these changes tested?
- New executing kernel tests in `CometCodegenSuite` that compile a kernel,
run one batch, and read the output back via `CometVector`, covering each
vulnerable path:
- constant-folded `map_concat` (map),
- constant-folded `array(...)` with more elements than the presized child
(array),
- `Array<Struct<Int, String>>` (struct nested in a collection).
- Source assertions in `CometCodegenSourceSuite`: nested fixed-width
children emit `setSafe`, and a top-level scalar output keeps the `set` fast
path.
- All three executing tests fail (overflow) if the fix is reverted.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]