laskoviymishka commented on code in PR #1127:
URL: https://github.com/apache/iceberg-go/pull/1127#discussion_r3302133929
##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile)
(*RoaringPositionBitmap, error
return nil, fmt.Errorf("read DV blob at offset %d: %w", offset,
err)
}
- cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+ blobCardinality, hasBlobCardinality, err :=
blobCardinality(reader.Blobs(), offset, size)
if err != nil {
return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
}
- if !ok {
+ manifestCardinality := dvFile.Count()
+ hasManifestCardinality := manifestCardinality > 0
Review Comment:
I think there's a subtle issue here. `record_count` is field 103 in the spec
— non-nullable `long`, no absent state — and Java writes `0` for empty DVs
(BaseDVFileWriter via `deletes.positions().cardinality()`). Treating
`manifestCardinality > 0` as "manifest didn't tell us" means a real
disagreement like blob=5 / manifest=0 falls through to "blob-only" and gets
accepted silently, which is exactly the case the PR is trying to catch.
Since `RecordCount` is a non-pointer `int64` with no representable absent
state, I'd just treat it as always present — `hasManifestCardinality := true` —
and let the agree/disagree branches handle zero naturally. wdyt?
##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile)
(*RoaringPositionBitmap, error
return nil, fmt.Errorf("read DV blob at offset %d: %w", offset,
err)
}
- cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+ blobCardinality, hasBlobCardinality, err :=
blobCardinality(reader.Blobs(), offset, size)
Review Comment:
The local `blobCardinality` shadows the package-level function of the same
name on this line — readability hazard since the identifier resolves to the
int64 result for the rest of the function, including the three uses below
(`manifestCardinality != blobCardinality`, the `Errorf` arg, and
`expectedCardinality = blobCardinality` in the `case hasBlobCardinality:` arm).
I'd rename the local to `puffinCard` (mirrors the `manifestCardinality`
source-prefix naming) and touch up all 4 sites — declaration plus the three
uses inside the switch.
##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile)
(*RoaringPositionBitmap, error
return nil, fmt.Errorf("read DV blob at offset %d: %w", offset,
err)
}
- cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+ blobCardinality, hasBlobCardinality, err :=
blobCardinality(reader.Blobs(), offset, size)
if err != nil {
return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
}
- if !ok {
+ manifestCardinality := dvFile.Count()
+ hasManifestCardinality := manifestCardinality > 0
+
+ var expectedCardinality int64
+ switch {
+ case hasManifestCardinality && hasBlobCardinality:
+ if manifestCardinality != blobCardinality {
+ return nil, fmt.Errorf("manifest record_count %d
disagrees with puffin cardinality property %d",
Review Comment:
Every other error in `ReadDV` carries the DV file path (the `"DV file %s:
%w"` wrap above, plus `read DV blob at offset %d`, etc.); this one's the odd
one out. Cheap to include and it makes triage from a server log a lot easier:
```go
return nil, fmt.Errorf("DV file %s: manifest record_count %d disagrees with
puffin cardinality property %d",
dvFile.FilePath(), manifestCardinality, blobCardinality)
```
##########
table/dv/deletion_vector_test.go:
##########
@@ -413,18 +413,29 @@ func TestReadDVCardinalityValidation(t *testing.T) {
assert.Equal(t, int64(5), bm.Cardinality())
})
- t.Run("mismatched cardinality is rejected", func(t *testing.T) {
+ t.Run("manifest and puffin cardinality disagreement is rejected",
func(t *testing.T) {
dir := t.TempDir()
path, meta := writePuffinWithDVBlobAndProps(t, dir,
dvBlobBytes, map[string]string{
"referenced-data-file":
"s3://bucket/data/data-001.parquet",
- // Bitmap actually has 5 positions; claim 99.
+ // Manifest record_count below says 5; claim 99 in the
Puffin property.
"cardinality": "99",
})
offset, size := meta.Offset, meta.Length
_, err := ReadDV(iceio.LocalFS{}, newDVTestFile(path, 5,
&offset, &size))
require.Error(t, err)
- // Pin the specific error path: DeserializeDV's "cardinality
- // mismatch", not the helper's "invalid cardinality" parse
error.
+ assert.Contains(t, err.Error(), "manifest record_count 5
disagrees with puffin cardinality property 99")
+ })
Review Comment:
Related to this PR but in the sibling subtest just below this one: the
existing `"missing cardinality property is tolerated with a warning"` test
doesn't reach the `default` branch anymore — `writePuffinWithDVBlob` still
embeds `"cardinality": "5"` and the test passes `count=5`, so it hits the
agree-path. The warning branch could be deleted and that test would keep
passing.
I'd add a sibling helper (or extend `writePuffinWithDVBlobAndProps`) that
omits the cardinality property entirely, paired with a manifest `count` that
also can't be used, so the test actually pins the `default` warning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]