zeroshade commented on code in PR #1127:
URL: https://github.com/apache/iceberg-go/pull/1127#discussion_r3523596775
##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile)
(*RoaringPositionBitmap, error
return nil, fmt.Errorf("read DV blob at offset %d: %w", offset,
err)
}
- cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+ blobCardinality, hasBlobCardinality, err :=
blobCardinality(reader.Blobs(), offset, size)
if err != nil {
return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
}
- if !ok {
+ manifestCardinality := dvFile.Count()
+ hasManifestCardinality := manifestCardinality > 0
+
+ var expectedCardinality int64
+ switch {
+ case hasManifestCardinality && hasBlobCardinality:
+ if manifestCardinality != blobCardinality {
+ return nil, fmt.Errorf("manifest record_count %d
disagrees with puffin cardinality property %d",
Review Comment:
Two polish items while updating this block: line 176 shadows the
package-level `blobCardinality` helper, so renaming the local to something like
`puffinCardinality` would avoid confusion; and this mismatch error should
include the DV file path, consistent with the neighboring `ReadDV` errors, to
make corrupt-file diagnostics actionable.
##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile)
(*RoaringPositionBitmap, error
return nil, fmt.Errorf("read DV blob at offset %d: %w", offset,
err)
}
- cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+ blobCardinality, hasBlobCardinality, err :=
blobCardinality(reader.Blobs(), offset, size)
if err != nil {
return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
}
- if !ok {
+ manifestCardinality := dvFile.Count()
+ hasManifestCardinality := manifestCardinality > 0
Review Comment:
This treats `record_count == 0` as if the manifest cardinality were absent.
For an `iceberg.DataFile`, manifest `record_count` is a required non-nullable
long, so zero is a valid expected cardinality and still needs to participate in
both checks. As written, manifest 0 + Puffin cardinality 5 + decoded bitmap
cardinality 5 falls into the blob-only path and is accepted instead of
rejected. Please treat the manifest cardinality as present for data files,
compare it to the Puffin cardinality whenever the property exists, and validate
the decoded bitmap against it, including zero. A regression test for manifest 0
/ Puffin 5 / bitmap 5 should error.
##########
table/dv/deletion_vector_test.go:
##########
@@ -413,18 +413,29 @@ func TestReadDVCardinalityValidation(t *testing.T) {
assert.Equal(t, int64(5), bm.Cardinality())
})
- t.Run("mismatched cardinality is rejected", func(t *testing.T) {
+ t.Run("manifest and puffin cardinality disagreement is rejected",
func(t *testing.T) {
dir := t.TempDir()
path, meta := writePuffinWithDVBlobAndProps(t, dir,
dvBlobBytes, map[string]string{
"referenced-data-file":
"s3://bucket/data/data-001.parquet",
- // Bitmap actually has 5 positions; claim 99.
+ // Manifest record_count below says 5; claim 99 in the
Puffin property.
"cardinality": "99",
})
offset, size := meta.Offset, meta.Length
_, err := ReadDV(iceio.LocalFS{}, newDVTestFile(path, 5,
&offset, &size))
require.Error(t, err)
- // Pin the specific error path: DeserializeDV's "cardinality
- // mismatch", not the helper's "invalid cardinality" parse
error.
+ assert.Contains(t, err.Error(), "manifest record_count 5
disagrees with puffin cardinality property 99")
+ })
+
+ t.Run("matching manifest and puffin cardinality still validates
bitmap", func(t *testing.T) {
+ dir := t.TempDir()
+ path, meta := writePuffinWithDVBlobAndProps(t, dir,
dvBlobBytes, map[string]string{
+ "referenced-data-file":
"s3://bucket/data/data-001.parquet",
+ // Both metadata sources agree, but the bitmap actually
has 5 positions.
+ "cardinality": "99",
+ })
+ offset, size := meta.Offset, meta.Length
+ _, err := ReadDV(iceio.LocalFS{}, newDVTestFile(path, 99,
&offset, &size))
+ require.Error(t, err)
assert.Contains(t, err.Error(), "cardinality mismatch")
Review Comment:
Nearby real line 442 still claims to cover the missing-cardinality-property
path, but it calls `writePuffinWithDVBlob`, which writes `cardinality: 5`, so
the test exercises the normal cardinality path instead of missing-property
tolerance. Please add a fixture/helper that actually omits the cardinality
property, and include a zero-cardinality case so the manifest-zero behavior is
covered explicitly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]