yaooqinn opened a new pull request, #12183:
URL: https://github.com/apache/gluten/pull/12183

   ## Motivation
   
   The V2 `serializeWithStats` fast path (opt-in via 
`spark.gluten.sql.columnar.tableCache.partitionStats.enabled`) used to catch 
only `UnsatisfiedLinkError`. Any `IllegalArgumentException` from 
`parseFramedBytes` (corrupt magic, truncated frame, statsLen/bytesLen mismatch) 
or `KryoException` from `deserializeStats` would bubble out and **fail the 
entire Spark task**. Since V2 is an opt-in optimization, a malformed frame for 
one cached batch should not be user-visible — the existing legacy `serialize()` 
fallback should absorb it just like an `UnsatisfiedLinkError` does.
   
   ## Change
   
   Layered catch in the per-batch fast path:
   
   ```scala
   case e: UnsatisfiedLinkError =>
     // capability gone for the JVM lifetime; trip the latch
     markStatsExtUnavailable(e); fallbackToLegacy()
   case NonFatal(e) =>
     // per-batch corruption; do NOT trip the latch
     warnCorruptStatsFrame(e); fallbackToLegacy()
   ```
   
   The capability latch (`statsExtAvailableFlag`) is deliberately one-way for 
the JVM lifetime — a native-symbol mismatch isn't recoverable. Corrupt-frame 
events, in contrast, are per-batch and shouldn't poison the latch: the next 
batch retries the fast path.
   
   To keep log output bounded when a native regression produces malformed 
frames batch after batch, `warnCorruptStatsFrame` caps warnings at 100 per JVM 
with a final summary line.
   
   ## Refactor
   
   To make the catch site unit-testable (previously nested inside an anonymous 
`Iterator` lambda inside `mapPartitions`), the fast path was extracted into a 
companion-object helper `serializeOneBatchWithStats(jni, handle, numRows, 
structSchema, fallbackToLegacy)`. The first commit is the pure refactor; the 
second commit adds the new arm + tests.
   
   `resetStatsExtAvailableForTesting()` is dropped — it had no callers after 
the new helper suite switched to resetting the flag via reflection directly.
   
   ## Test coverage
   
   New `ColumnarCachedBatchSerializerHelperSuite` (4 cases, Mockito-stubbed JNI 
wrapper, no native runtime needed):
   
   - corrupt magic → fallback `CachedBatch` with `stats=null`
   - truncated framed bytes → fallback
   - Kryo-corrupt statsBlob → fallback
   - `UnsatisfiedLinkError` → still trips capability latch (regression)
   
   Local sentinel — all 9 `ColumnarCachedBatch*Suite` green, 46/0 succeeded, 0 
regression vs apache/main.
   
   ## Risks
   
   - Log cap is hard-coded at 100. Happy to move to a `SQLConf` if reviewers 
prefer.
   - `NonFatal` deliberately does not absorb `OutOfMemoryError` / 
`StackOverflowError` / control-flow exceptions — those should still propagate.
   
   Part of the V2 hardening series following #12176.
   
   Generated-by: Claude claude-opus-4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to