alamb commented on PR #9574:
URL: https://github.com/apache/arrow-rs/pull/9574#issuecomment-4172905627
BTW Codex says it found issues with this PR; I haven't had time to review
them carefully but I would like to before we merge this. I hope to find time
for this PR over the next few days
<details><summary>Details</summary>
<p>
• Findings
- High: LargeUtf8 filtering is broken, and LargeBinary is very likely
broken for the same reason. In
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:177 the code
handles both Utf8 and LargeUtf8 by building a StringArray, and in
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:192 the fallback always
builds a BinaryArray.
But the synthetic schema uses the real Arrow type at
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:123, so
RecordBatch::try_new fails for LargeUtf8/
LargeBinary. I reproduced this in a disposable worktree with an async
row-filter test; it panicked with: expected LargeUtf8 but found Utf8 at column
index 0.
- High: the fallback “all pages are dictionary encoded” check is not safe
and can return wrong query results on files without page encoding stats. The
code explicitly
falls back to a heuristic in
parquet/src/arrow/arrow_reader/dictionary_pruning.rs:234 and treats
Encoding::PLAIN as acceptable at parquet/src/arrow/arrow_reader/
dictionary_pruning.rs:253. But PLAIN is exactly the ambiguous case for
dictionary fallback pages. The metadata docs in
parquet/src/file/metadata/mod.rs:1086 say
page encoding stats were added specifically to answer this question
safely. If a column chunk contains both dictionary pages and plain fallback
data pages and lacks
stats, this code can incorrectly prune rows.
- Medium: the new logic is not protected by meaningful tests. I reverted
the PR’s implementation files in a temporary worktree and reran the existing
decoder-focused
tests; they still passed. That means the current tests do not fail when
dictionary pruning is removed. Coverage confirms this:
target/llvm-cov/html/coverage/Users/
andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html
is only 34.06% line-covered, with the BYTE_ARRAY path (target/llvm-cov/html/
coverage/Users/andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:142),
the full page-stats path (target/llvm-cov/html/coverage/
Users/andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:217),
the heuristic fallback (target/llvm-cov/html/coverage/Users/
andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:240),
and all INT32 conversion logic (target/llvm-cov/html/coverage/Users/
andrewlamb/Software/arrow-rs/parquet/src/arrow/arrow_reader/dictionary_pruning.rs.html:269)
uncovered.
Public API: the PR adds ArrowPredicate::evaluate_dictionary and exports
DictionaryPredicateResult from parquet/src/arrow/arrow_reader/filter.rs:25.
That is a public
API expansion, but it does not look breaking because the new trait method
has a default implementation.
Verification
I ran cargo llvm-cov --html test -p parquet; the full parquet suite passed
and the report is under target/llvm-cov/html/index.html. I also did an ablative
run in a
disposable worktree by reverting the PR’s implementation files and running
cargo test -p parquet test_decoder_ -- --nocapture; all 13 decoder tests still
passed.
Finally, I added a temporary repro test in another disposable worktree and
confirmed the LargeUtf8 failure above.
</p>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]