mbutrovich opened a new pull request, #4549:
URL: https://github.com/apache/datafusion-comet/pull/4549

   ## Which issue does this PR close?
   
   Closes #4539.
   
   ## Rationale for this change
   
   Routing a map- or array-typed expression through the JVM codegen dispatcher 
could corrupt the output. For example, `map_concat(map(1,'a',2,'b'), 
map(3,'c'))` produced `{1->a, 2->b, 0->c}` (last key corrupted to 0). The query 
runs natively with no fallback, so this is a wrong-answer bug.
   
   Root cause: fixed-width scalar writes into nested collection children used 
Arrow's `set`, which does not grow the buffer. `allocateOutput` sizes a vector 
from `numRows`, but a collection child's element count is the data-dependent 
sum of per-row collection sizes, not bounded by `numRows`. The total is also 
unknown until the per-row `ev.code` has run inside the write loop, so the child 
cannot be presized. Once a row held more entries than the presized child, the 
write ran off the end of the buffer.
   
   Comet enables Arrow unsafe memory access 
(`arrow.enable_unsafe_memory_access=true`, set by `NativeBase`), so that 
out-of-bounds `set` is not checked at runtime: it writes past the child and the 
value reads back as 0. With bounds checking on (e.g. the kernel unit test), the 
same write throws `IndexOutOfBoundsException`. Same bug, two symptoms.
   
   The value side (String) already used `setSafe`, which is why values survived 
and keys did not.
   
   ## What changes are included in this PR?
   
   - `emitWrite` takes a `nested` flag. Fixed-width Boolean/numeric/temporal 
writes use `setSafe` when nested and the `set` fast path otherwise.
   - Call sites:
     - array elements and map key/value pass `nested = true` (always written at 
a cumulative index).
     - struct fields propagate the struct's own `nested` (fields are co-indexed 
with the struct, so a field is nested exactly when the struct is).
     - the root output stays `set` (presized to `numRows`, one scalar per row).
   - Variable-width (String/Binary) and Decimal writes already used `setSafe` 
and are unchanged.
   
   ## How are these changes tested?
   
   - New executing kernel tests in `CometCodegenSuite` that compile a kernel, 
run one batch, and read the output back via `CometVector`, covering each 
vulnerable path:
     - constant-folded `map_concat` (map),
     - constant-folded `array(...)` with more elements than the presized child 
(array),
     - `Array<Struct<Int, String>>` (struct nested in a collection).
   - Source assertions in `CometCodegenSourceSuite`: nested fixed-width 
children emit `setSafe`, and a top-level scalar output keeps the `set` fast 
path.
   - All three executing tests fail (overflow) if the fix is reverted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to