rynewang commented on PR #48717:
URL: https://github.com/apache/arrow/pull/48717#issuecomment-3741469072
@pitrou Thanks for the review! After investigation, I found that the
sentinel approach is needed because:
**The bug is manifestable through the public `WriteBatch` API.** When users
call:
```cpp
ByteArray empty{0, nullptr};
writer->WriteBatch(n, nullptr, nullptr, &empty);
```
The empty string statistics are lost because `Min()`/`Max()` skip nullptr
values.
**Why other paths work:**
- Arrow arrays → `VisitArraySpanInline` uses sentinel
(`visit_data_inline.h:96`)
- Reading stats from Parquet → `PlainDecode` uses `c_str()` (always non-null)
- `ByteArray("")` literals → non-null pointer to string literal
**Why `ByteArray("")` won't work for defaults:**
It has a non-null pointer, so we couldn't distinguish "no value yet" from
"actual empty string value".
I've added tests using the public `WriteBatch` API that fail without the fix
and pass with it:
- `TestByteArrayValuesWriter.EmptyStringWithNullptrStats`
- `TestByteArrayValuesWriter.AllEmptyStringsWithNullptrStats`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]