mbutrovich opened a new issue, #4440:
URL: https://github.com/apache/datafusion-comet/issues/4440

   Reading through PR #3221, PR #3266, PR #3649, and epic #3268, I want to 
surface some questions about whether the rationale for 
`CometNativeColumnarToRowExec` still holds. I might be missing context, so this 
is a question rather than a proposal.
   
   ### Background
   
   PR #3221 (landing) was upfront that there was no significant throughput win:
   
   > Although this native implementation does not provide significant speed 
improvements, the main benefit is reduced GC pressure by avoiding Java 
allocations.
   
   The benchmark table in that PR showed native winning 1 of 8 scenarios, 
losing 3, tying 4.
   
   PR #3649 (enable by default) retracted its ~10% TPC-H claim as misattributed 
to native scan, leaving "no regression + GC pressure" as the standing rationale.
   
   I could not find a GC measurement in any of the PR threads, in #3268, in the 
operator's SQLMetrics, or in `spark/benchmarks/`. The GC pressure benefit 
appears to be theoretical, not measured.
   
   ### A quick re-run on main (Spark 4.1, 1M rows, fixedWidthOnly)
   
   ```
   Spark (ColumnarToRowExec)                     68 ms
   Comet JVM (CometColumnarToRowExec)            65 ms
   Comet Native (CometNativeColumnarToRowExec)   64 ms
   ```
   
   All within ~5%, roughly consistent with the numbers in PR #3221.
   
   ### A JFR pass at allocations
   
   Running the same benchmark with `-XX:StartFlightRecording=settings=profile`, 
I expected to see per-row allocations from `unsafeRow.copy()` in 
`NativeColumnarToRowConverter.scala:142`. Out of 2,809 `byte[]` allocation 
samples, zero appeared in any C2R operator's stack. All came from the parquet 
scan and broadcast/serializer paths shared across implementations.
   
   The most likely explanation is that under the `.noop()` sink the benchmark 
uses, HotSpot escape analysis elides Native's per-row `.copy()` (and any 
equivalent allocations on the JVM path). If that is right, this benchmark setup 
cannot validate the GC pressure claim either way, since the rows never escape 
the iteration.
   
   ### Questions at address
   
   1. Is there a measurement of native C2R's GC benefit that I am overlooking? 
A pointer to a concrete workload with JFR or `jdk.GarbageCollection` evidence 
would help me understand the current value.
   
   2. Would adding a retain-mode benchmark variant to 
`CometColumnarToRowBenchmark` (e.g., `collect`, hash-agg, broadcast build, 
anything that defeats EA) be worth it, so the GC claim becomes testable?
   
   3. Structurally, `NativeRowIterator.next` calls `unsafeRow.copy()` per row, 
which allocates `byte[sizeInBytes]` + `UnsafeRow` per row. The JVM path 
allocates the same shape of garbage when the consumer retains. Is there a 
retention pattern where the two allocation profiles differ in a way that 
matters?
   
   4. If throughput parity and unmeasured GC turn out to be confirmed, would it 
be worth discussing whether the ~3,000 LOC of Rust + JNI + glue is paying for 
itself, and whether the operator should stay enabled by default? Not proposing 
removal, just asking what the cost/benefit looks like.
   
   I was a reviewer on PR #3221, so this is partly a question I am putting back 
to myself: I do not think I challenged the GC framing at the time, and I want 
to make sure it actually holds up. Apologies if any of this has already been 
discussed somewhere I missed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to