yaooqinn opened a new pull request, #12147:
URL: https://github.com/apache/gluten/pull/12147
## What changes were proposed in this pull request?
`ColumnarCachedBatchSerializer.read` used `input.available() > 0` as
an EOF probe to guard the trailing `hasStats` / `hasSchema` boolean
markers, intending to tolerate a hypothetical V1 wire that lacked
them. This PR removes that probe and reads the markers directly.
## Why are the changes needed?
The probe is broken on two counts:
1. **`Kryo Input.available()` is not an EOF probe.** It returns
`(limit - position) + underlyingStream.available()`. The JDK
`InputStream.available()` contract permits any implementation to
return 0 even when more data follows — `BufferedInputStream` over
shuffle-spill / network chunk boundaries routinely does so. When
the Kryo buffer is drained AND the underlying stream reports 0 at
the trailing-boolean byte position, the reader falsely concludes
EOF, skips `hasStats`, and the next `readClassAndObject`
interprets the stats payload (which contains the schema JSON) as
a class name.
Stack we observed (truncated):
```
ClassNotFoundException: {"type":"struct","fields":[...
at DefaultClassResolver.readName(DefaultClassResolver.java:154)
at DefaultClassResolver.readClass(DefaultClassResolver.java:133)
at Kryo.readClass(Kryo.java:693)
at Kryo.readClassAndObject(Kryo.java:804)
at ColumnarCachedBatchSerializer$$anon$2.hasNext(...)
```
2. **Spark cached blocks live and die within a single application
instance.** `BlockManager.stop()` clears `localDirs` on app exit,
so producer and consumer always run the same jar — the trailing
boolean markers are necessarily present. The V1 wire era (commit
`97632d8435`) was never released to any tag (`git tag --contains
97632d8435` returns empty), so the guard protects nothing.
The length-bound `require(... <= maxLen ...)` guard from
`491070bf34` is preserved — it remains useful against
`NegativeArraySizeException` / oversized allocation and is
orthogonal.
A truncated stream now surfaces `KryoException` rather than silently
returning null and letting the next `readClass` misread payload
bytes. A new test
`ColumnarCachedBatchKryoBoundaryProbeBugSuite#"truncated wire ..."`
locks this contract.
## Does this PR introduce any user-facing change?
No (bug fix in a build that has not shipped to any tagged release).
## How was this patch tested?
New `ColumnarCachedBatchKryoBoundaryProbeBugSuite` (2 tests):
- `"multi-batch deserialize survives boundary-aligned trailing-boolean
probe"`
— deterministic repro of the production stack using a
1-byte-per-read `InputStream` that returns `available() == 0`.
Fails on the pre-patch code, passes after the fix.
- `"truncated wire (no trailing hasStats/hasSchema) fails fast, not
silently"`
— locks the new fail-fast contract.
Verified locally on `-Pspark-4.1 -Pscala-2.13`; affected suites
(`Framed`, `BuildFilter`, `StatsBlob`, `ShipBlockerMarshal`,
`IntFamilyMarshal`) all green (23/23).
---
**Was this patch authored or co-authored using generative AI tooling?**
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]