zeroshade commented on code in PR #892:
URL: https://github.com/apache/arrow-go/pull/892#discussion_r3521384733
##########
parquet/metadata/bloom_filter.go:
##########
@@ -500,7 +500,7 @@ func (r *RowGroupBloomFilterReader) GetColumnBloomFilter(i
int) (BloomFilter, er
}
}()
- if _, err = sectionRdr.Read(headerBuf.Bytes()); err != nil {
+ if _, err = io.ReadFull(sectionRdr, headerBuf.Bytes()); err != nil {
Review Comment:
`io.ReadFull` fills the whole buffer, but in the fallback path
(`BloomFilterLength` absent) this buffer is the fixed 256-byte *probe*, not the
exact filter size. If fewer than 256 bytes remain between the filter offset and
EOF — e.g. a legacy/foreign file with a small trailing Bloom filter — this now
returns `io.ErrUnexpectedEOF`, whereas the old `Read` tolerated a short probe.
Our own writer always sets `BloomFilterLength` (line 620), so we never hit
this, but the 256 fallback exists precisely for files that don't.
Clamping the probe to the section length keeps the short-read fix while
never demanding bytes past EOF (`io.SectionReader.Size()` == `sourceFileSize -
offset`). `ResizeNoShrink` sets the length down to the clamp, so the downstream
`len(headerBuf.Bytes())` checks stay correct:
```suggestion
if sz := int(sectionRdr.Size()); sz < len(headerBuf.Bytes()) {
headerBuf.ResizeNoShrink(sz)
}
if _, err = io.ReadFull(sectionRdr, headerBuf.Bytes()); err != nil {
```
Reproducible in this PR's own harness: shrink the filter in
`makeBloomFilterReaderTestData` to the minimum size and drop
`BloomFilterLength` — the section is then ~50 bytes and `io.ReadFull(256)`
fails with `ErrUnexpectedEOF` on the unpatched line.
##########
parquet/metadata/bloom_filter_reader_test.go:
##########
@@ -283,6 +341,83 @@ func TestBloomFilterRoundTrip(t *testing.T) {
suite.Run(t, new(EncryptedBloomFilterBuilderSuite))
}
+func TestBloomFilterReaderHandlesShortReads(t *testing.T) {
+ data, fileMeta, insertedHash := makeBloomFilterReaderTestData(t)
+
+ t.Run("with bloom filter length", func(t *testing.T) {
+ bf := readBloomFilterWithShortReads(t, data, fileMeta, 7)
+ assert.True(t, bf.CheckHash(insertedHash))
+ })
+
+ t.Run("without bloom filter length", func(t *testing.T) {
+ data, fileMeta, insertedHash := makeBloomFilterReaderTestData(t)
+ fileMeta.RowGroups[0].Columns[0].MetaData.BloomFilterLength =
nil
+
+ bf := readBloomFilterWithShortReads(t, data, fileMeta, 256)
+ assert.True(t, bf.CheckHash(insertedHash))
+ })
+}
+
+func makeBloomFilterReaderTestData(t *testing.T) ([]byte,
*metadata.FileMetaData, uint64) {
+ t.Helper()
+
+ sc := schema.NewSchema(schema.MustGroup(
+ schema.NewGroupNode("schema", parquet.Repetitions.Repeated,
+ schema.FieldList{
+ schema.NewByteArrayNode("c1",
parquet.Repetitions.Optional, -1),
+ }, -1)))
+
+ props := parquet.NewWriterProperties()
+ bldr := metadata.FileBloomFilterBuilder{Schema: sc}
+ metaBldr := metadata.NewFileMetadataBuilder(sc, props, nil)
+ rgMeta := metaBldr.AppendRowGroup()
+ filterMap := make(map[string]metadata.BloomFilterBuilder)
+ bldr.AppendRowGroup(rgMeta, filterMap)
+
+ bf := metadata.NewBloomFilter(1024, 1024, memory.NewGoAllocator())
Review Comment:
Nice regression test. One gap: this 1024-byte filter (with `sourceFileSize
== buf.Len()`) keeps the section ≥ 256 bytes, so the "without bloom filter
length" subtest can't reach the `section < 256` case that the header-read
change actually affects. A second case with a minimum-size filter and no
`BloomFilterLength` would exercise (and guard) that fallback — it fails on the
current diff and passes with the clamp suggested in `bloom_filter.go`.
Minor, non-blocking: `shortReadAtSeeker.Read` (and the `offset` field it
maintains) is never exercised — `Input` is `parquet.ReaderAtSeeker` (=
`io.ReaderAt` + `io.Seeker`), and the read path goes through
`io.SectionReader.Read → ReadAt`.
##########
parquet/metadata/bloom_filter.go:
##########
@@ -527,7 +527,7 @@ func (r *RowGroupBloomFilterReader) GetColumnBloomFilter(i
int) (BloomFilter, er
copy(buf.Bytes(), headerBuf.Bytes()[headerSize:])
}
- if _, err = sectionRdr.Read(buf.Bytes()[filterBytesInHeader:]);
err != nil {
+ if _, err = io.ReadFull(sectionRdr,
buf.Bytes()[filterBytesInHeader:]); err != nil {
Review Comment:
This one is spot-on and the real prize: the exact remaining bitset length is
known here, so a short `Read` previously built a filter from a partially-read
bitset **with no error** (silent false negatives on `CheckHash`). `io.ReadFull`
is exactly right — no change needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]