andygrove commented on PR #4672:
URL: 
https://github.com/apache/datafusion-comet/pull/4672#issuecomment-4800350232

   Thanks for the rework @sandugood, and for pulling the benchmark numbers and 
the `cargo bench` invocation into the description. The primitive, date, and 
timestamp paths are implemented the way #4626 was after. Reading the values as 
a slice and building a `NullBuffer` from the flipped Spark bitmap is the right 
shape, and the speedups are great.
   
   One thing I want to call out as a nice catch: threading the timezone through 
and calling `.with_timezone_opt(timezone)` is actually load-bearing, not just 
defensive. `PrimitiveBuilder::append_array` asserts `data_type` equality 
(`primitive_builder.rs:291`), so without it the timestamp path would panic 
whenever the column carries a timezone. Good that you handled it.
   
   A few items before this is ready:
   
   **1. The boolean nullable path is UB, and this revision moved it the wrong 
way.** This is item 1 from the earlier review. The new nullable branch does:
   
   ```rust
   let slice = unsafe { std::slice::from_raw_parts(self.element_offset as 
*const bool, num_elements) };
   for (idx, &value) in slice.iter().enumerate() { ... }
   ```
   
   Rust requires every `bool` to be exactly `0` or `1`. For null slots the JVM 
byte is not guaranteed to be `0`/`1`, and `for (idx, &value)` materializes the 
`bool` out of each byte even for slots that get skipped, which is UB. The 
previous version of this branch read `*const u8` and did `*ptr != 0`, and the 
non-nullable branch right below still does the safe thing:
   
   ```rust
   let values = unsafe { std::slice::from_raw_parts(self.element_offset as 
*const u8, num_elements) };
   for &value in values { builder.append_value(value != 0); }
   ```
   
   Could you mirror that `*const u8` pattern in the nullable branch 
(`builder.append_value(value != 0)`)? Since Spark stores booleans as one byte 
each rather than bit-packed, a per-element loop here is reasonable, so the `u8` 
cast is really all that is needed. This is very likely the UB you hit in CI 
earlier, since there is a `miri` workflow. Worth confirming against a nullable 
boolean array with a null slot once the cast is in.
   
   **2. Copy-paste leftover in `append_dates_to_builder`.** The `debug_assert!` 
message reads `"append_timestamps: element_offset is null"` and the comment 
above says `contiguous i64 data`, but this is the dates path. Both should say 
`append_dates` and `i32`.
   
   **3. A couple of targeted tests would help.** The bitmap flip and the 
partial-byte handling are easy to get subtly wrong. Would you be open to adding 
unit tests for nullable arrays where `num_elements` is not a multiple of 8 with 
a null in the final byte (say 7, 9, 17), plus all-null and all-valid cases? 
That locks in the trailing-bit behavior, which currently relies on the 
`BooleanBuffer` length trimming the flipped padding bits.
   
   **4. Formatting.** The `process_primitive_lists` macro in `row.rs` looks 
like the indentation drifted. A quick `cargo fmt` should clean it up.
   
   **5. Minor.** The byte reinterpretation of the null word assumes 
little-endian bit ordering. That matches Arrow and every platform we target, so 
it is fine, but a one-line comment next to the `null_words as *const u8` cast 
noting the little-endian assumption would help the next reader.
   
   The boolean path is the only blocker. Everything else is small. Once that is 
a `u8` read I think this is in good shape.
   
   *Disclosure: this review was assisted by an AI tool (Anthropic Claude via 
Claude Code). I read the diff, the linked issue, and the prior review thread 
before forming these comments, and I take responsibility for them.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to