laskoviymishka commented on code in PR #1127:
URL: https://github.com/apache/iceberg-go/pull/1127#discussion_r3302133929


##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile) 
(*RoaringPositionBitmap, error
                return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, 
err)
        }
 
-       cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+       blobCardinality, hasBlobCardinality, err := 
blobCardinality(reader.Blobs(), offset, size)
        if err != nil {
                return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
        }
-       if !ok {
+       manifestCardinality := dvFile.Count()
+       hasManifestCardinality := manifestCardinality > 0

Review Comment:
   I think there's a subtle issue here. `record_count` is field 103 in the spec 
— non-nullable `long`, no absent state — and Java writes `0` for empty DVs 
(BaseDVFileWriter via `deletes.positions().cardinality()`). Treating 
`manifestCardinality > 0` as "manifest didn't tell us" means a real 
disagreement like blob=5 / manifest=0 falls through to "blob-only" and gets 
accepted silently, which is exactly the case the PR is trying to catch.
   
   Since `RecordCount` is a non-pointer `int64` with no representable absent 
state, I'd just treat it as always present — `hasManifestCardinality := true` — 
and let the agree/disagree branches handle zero naturally. wdyt?



##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile) 
(*RoaringPositionBitmap, error
                return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, 
err)
        }
 
-       cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+       blobCardinality, hasBlobCardinality, err := 
blobCardinality(reader.Blobs(), offset, size)

Review Comment:
   The local `blobCardinality` shadows the package-level function of the same 
name on this line — readability hazard since the identifier resolves to the 
int64 result for the rest of the function, including the three uses below 
(`manifestCardinality != blobCardinality`, the `Errorf` arg, and 
`expectedCardinality = blobCardinality` in the `case hasBlobCardinality:` arm).
   
   I'd rename the local to `puffinCard` (mirrors the `manifestCardinality` 
source-prefix naming) and touch up all 4 sites — declaration plus the three 
uses inside the switch.



##########
table/dv/deletion_vector.go:
##########
@@ -179,21 +173,36 @@ func ReadDV(fs iceio.IO, dvFile iceberg.DataFile) 
(*RoaringPositionBitmap, error
                return nil, fmt.Errorf("read DV blob at offset %d: %w", offset, 
err)
        }
 
-       cardinality, ok, err := blobCardinality(reader.Blobs(), offset, size)
+       blobCardinality, hasBlobCardinality, err := 
blobCardinality(reader.Blobs(), offset, size)
        if err != nil {
                return nil, fmt.Errorf("DV file %s: %w", dvFile.FilePath(), err)
        }
-       if !ok {
+       manifestCardinality := dvFile.Count()
+       hasManifestCardinality := manifestCardinality > 0
+
+       var expectedCardinality int64
+       switch {
+       case hasManifestCardinality && hasBlobCardinality:
+               if manifestCardinality != blobCardinality {
+                       return nil, fmt.Errorf("manifest record_count %d 
disagrees with puffin cardinality property %d",

Review Comment:
   Every other error in `ReadDV` carries the DV file path (the `"DV file %s: 
%w"` wrap above, plus `read DV blob at offset %d`, etc.); this one's the odd 
one out. Cheap to include and it makes triage from a server log a lot easier:
   
   ```go
   return nil, fmt.Errorf("DV file %s: manifest record_count %d disagrees with 
puffin cardinality property %d",
       dvFile.FilePath(), manifestCardinality, blobCardinality)
   ```



##########
table/dv/deletion_vector_test.go:
##########
@@ -413,18 +413,29 @@ func TestReadDVCardinalityValidation(t *testing.T) {
                assert.Equal(t, int64(5), bm.Cardinality())
        })
 
-       t.Run("mismatched cardinality is rejected", func(t *testing.T) {
+       t.Run("manifest and puffin cardinality disagreement is rejected", 
func(t *testing.T) {
                dir := t.TempDir()
                path, meta := writePuffinWithDVBlobAndProps(t, dir, 
dvBlobBytes, map[string]string{
                        "referenced-data-file": 
"s3://bucket/data/data-001.parquet",
-                       // Bitmap actually has 5 positions; claim 99.
+                       // Manifest record_count below says 5; claim 99 in the 
Puffin property.
                        "cardinality": "99",
                })
                offset, size := meta.Offset, meta.Length
                _, err := ReadDV(iceio.LocalFS{}, newDVTestFile(path, 5, 
&offset, &size))
                require.Error(t, err)
-               // Pin the specific error path: DeserializeDV's "cardinality
-               // mismatch", not the helper's "invalid cardinality" parse 
error.
+               assert.Contains(t, err.Error(), "manifest record_count 5 
disagrees with puffin cardinality property 99")
+       })

Review Comment:
   Related to this PR but in the sibling subtest just below this one: the 
existing `"missing cardinality property is tolerated with a warning"` test 
doesn't reach the `default` branch anymore — `writePuffinWithDVBlob` still 
embeds `"cardinality": "5"` and the test passes `count=5`, so it hits the 
agree-path. The warning branch could be deleted and that test would keep 
passing.
   
   I'd add a sibling helper (or extend `writePuffinWithDVBlobAndProps`) that 
omits the cardinality property entirely, paired with a manifest `count` that 
also can't be used, so the test actually pins the `default` warning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to