Re: [PR] fix(parquet): read full bloom filter buffers [arrow-go]

via GitHub Fri, 03 Jul 2026 11:25:59 -0700


zeroshade commented on code in PR #892:
URL: https://github.com/apache/arrow-go/pull/892#discussion_r3521384733



##########
parquet/metadata/bloom_filter.go:
##########
@@ -500,7 +500,7 @@ func (r *RowGroupBloomFilterReader) GetColumnBloomFilter(i 
int) (BloomFilter, er
                }
        }()
 
-       if _, err = sectionRdr.Read(headerBuf.Bytes()); err != nil {
+       if _, err = io.ReadFull(sectionRdr, headerBuf.Bytes()); err != nil {

Review Comment:
   `io.ReadFull` fills the whole buffer, but in the fallback path 
(`BloomFilterLength` absent) this buffer is the fixed 256-byte *probe*, not the 
exact filter size. If fewer than 256 bytes remain between the filter offset and 
EOF — e.g. a legacy/foreign file with a small trailing Bloom filter — this now 
returns `io.ErrUnexpectedEOF`, whereas the old `Read` tolerated a short probe. 
Our own writer always sets `BloomFilterLength` (line 620), so we never hit 
this, but the 256 fallback exists precisely for files that don't.
   
   Clamping the probe to the section length keeps the short-read fix while 
never demanding bytes past EOF (`io.SectionReader.Size()` == `sourceFileSize - 
offset`). `ResizeNoShrink` sets the length down to the clamp, so the downstream 
`len(headerBuf.Bytes())` checks stay correct:
   
   ```suggestion
        if sz := int(sectionRdr.Size()); sz < len(headerBuf.Bytes()) {
                headerBuf.ResizeNoShrink(sz)
        }
        if _, err = io.ReadFull(sectionRdr, headerBuf.Bytes()); err != nil {
   ```
   
   Reproducible in this PR's own harness: shrink the filter in 
`makeBloomFilterReaderTestData` to the minimum size and drop 
`BloomFilterLength` — the section is then ~50 bytes and `io.ReadFull(256)` 
fails with `ErrUnexpectedEOF` on the unpatched line.



##########
parquet/metadata/bloom_filter_reader_test.go:
##########
@@ -283,6 +341,83 @@ func TestBloomFilterRoundTrip(t *testing.T) {
        suite.Run(t, new(EncryptedBloomFilterBuilderSuite))
 }
 
+func TestBloomFilterReaderHandlesShortReads(t *testing.T) {
+       data, fileMeta, insertedHash := makeBloomFilterReaderTestData(t)
+
+       t.Run("with bloom filter length", func(t *testing.T) {
+               bf := readBloomFilterWithShortReads(t, data, fileMeta, 7)
+               assert.True(t, bf.CheckHash(insertedHash))
+       })
+
+       t.Run("without bloom filter length", func(t *testing.T) {
+               data, fileMeta, insertedHash := makeBloomFilterReaderTestData(t)
+               fileMeta.RowGroups[0].Columns[0].MetaData.BloomFilterLength = 
nil
+
+               bf := readBloomFilterWithShortReads(t, data, fileMeta, 256)
+               assert.True(t, bf.CheckHash(insertedHash))
+       })
+}
+
+func makeBloomFilterReaderTestData(t *testing.T) ([]byte, 
*metadata.FileMetaData, uint64) {
+       t.Helper()
+
+       sc := schema.NewSchema(schema.MustGroup(
+               schema.NewGroupNode("schema", parquet.Repetitions.Repeated,
+                       schema.FieldList{
+                               schema.NewByteArrayNode("c1", 
parquet.Repetitions.Optional, -1),
+                       }, -1)))
+
+       props := parquet.NewWriterProperties()
+       bldr := metadata.FileBloomFilterBuilder{Schema: sc}
+       metaBldr := metadata.NewFileMetadataBuilder(sc, props, nil)
+       rgMeta := metaBldr.AppendRowGroup()
+       filterMap := make(map[string]metadata.BloomFilterBuilder)
+       bldr.AppendRowGroup(rgMeta, filterMap)
+
+       bf := metadata.NewBloomFilter(1024, 1024, memory.NewGoAllocator())

Review Comment:
   Nice regression test. One gap: this 1024-byte filter (with `sourceFileSize 
== buf.Len()`) keeps the section ≥ 256 bytes, so the "without bloom filter 
length" subtest can't reach the `section < 256` case that the header-read 
change actually affects. A second case with a minimum-size filter and no 
`BloomFilterLength` would exercise (and guard) that fallback — it fails on the 
current diff and passes with the clamp suggested in `bloom_filter.go`.
   
   Minor, non-blocking: `shortReadAtSeeker.Read` (and the `offset` field it 
maintains) is never exercised — `Input` is `parquet.ReaderAtSeeker` (= 
`io.ReaderAt` + `io.Seeker`), and the read path goes through 
`io.SectionReader.Read → ReadAt`.



##########
parquet/metadata/bloom_filter.go:
##########
@@ -527,7 +527,7 @@ func (r *RowGroupBloomFilterReader) GetColumnBloomFilter(i 
int) (BloomFilter, er
                        copy(buf.Bytes(), headerBuf.Bytes()[headerSize:])
                }
 
-               if _, err = sectionRdr.Read(buf.Bytes()[filterBytesInHeader:]); 
err != nil {
+               if _, err = io.ReadFull(sectionRdr, 
buf.Bytes()[filterBytesInHeader:]); err != nil {

Review Comment:
   This one is spot-on and the real prize: the exact remaining bitset length is 
known here, so a short `Read` previously built a filter from a partially-read 
bitset **with no error** (silent false negatives on `CheckHash`). `io.ReadFull` 
is exactly right — no change needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(parquet): read full bloom filter buffers [arrow-go]

Reply via email to