mbutrovich opened a new issue, #4279:
URL: https://github.com/apache/datafusion-comet/issues/4279
## What is the problem the feature request solves?
`CometPlainVector.getBoolean(int)` already caches the last-read data-bitmap
byte to avoid re-reading it eight times per byte of bit-packed booleans:
```java
private byte booleanByteCache;
private int booleanByteCacheIndex = -1;
public boolean getBoolean(int rowId) {
int byteIndex = rowId >> 3;
if (byteIndex != booleanByteCacheIndex) {
booleanByteCache = getByte(byteIndex);
booleanByteCacheIndex = byteIndex;
}
return ((booleanByteCache >> (rowId & 7)) & 1) == 1;
}
```
The validity bitmap (checked by `isNullAt`) is also bit-packed one bit per
row, but it has no equivalent cache. Sequential row reads on a nullable column
hit the same validity-buffer byte eight times in a row. Every caller that reads
a column row-by-row (including the native scan output path, shuffle read path,
and the JVM UDF dispatch kernel in #4267) pays the duplicate reads.
## Describe the potential solution
Add a sibling byte cache for the validity bitmap, using the same pattern
already present for `booleanByteCache`:
```java
private byte validityByteCache;
private int validityByteCacheIndex = -1;
@Override
public boolean isNullAt(int rowId) {
if (this.valueBufferAddress == -1) return true;
int byteIndex = rowId >> 3;
if (byteIndex != validityByteCacheIndex) {
validityByteCache = /* read validity-buffer byte at byteIndex */;
validityByteCacheIndex = byteIndex;
}
return ((validityByteCache >> (rowId & 7)) & 1) == 0;
}
@Override
public void setNumNulls(int numNulls) {
super.setNumNulls(numNulls);
this.booleanByteCacheIndex = -1;
this.validityByteCacheIndex = -1; // invalidate on state transitions
}
```
Collapses the 8x duplicate validity-byte reads to one read per 8 rows on
sequential access.
## Additional context
- Mirrors the established pattern in the same class (`booleanByteCache`).
- The existing `setNumNulls` override resets `booleanByteCacheIndex`; the
validity cache would extend that.
- No API change. No behavior change. Strictly reduces per-row work for any
nullable column read sequentially.
- Random-access patterns see no win but no regression either (cache miss on
every call, same cost as today after the miss).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]