This is an automated email from the ASF dual-hosted git repository.

zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-go.git


The following commit(s) were added to refs/heads/main by this push:
     new 261edba7 test(puffin): golden round-trip a deletion-vector-v1 blob 
(#1041)
261edba7 is described below

commit 261edba7b4afe680473995cb6db12e85a69048c7
Author: Andrei Tserakhau <[email protected]>
AuthorDate: Tue May 12 18:35:09 2026 +0200

    test(puffin): golden round-trip a deletion-vector-v1 blob (#1041)
    
    Pins the puffin envelope shape around a deletion-vector-v1 blob without
    depending on the in-flight roaring decoder (#866). The fixture is two
    files:
    
    - puffin/testdata/deletion-vector-v1-payload.bin — a Java-produced
    64-bit roaring DV payload lifted directly from apache/iceberg test
    resources (small-alternating-values-position-index.bin, 50 bytes; bitmap
    encodes positions 1, 3, 5, 7, 9).
    
    - puffin/testdata/deletion-vector-v1.puffin — the same payload wrapped
    by puffin.Writer with the spec-canonical metadata (snapshot-id=-1,
    sequence-number=-1, deletion-vector-v1 blob type, referenced-data-file
    and cardinality properties).
    
    The test asserts both layers. Reader side: blob count, type,
    spec-mandated invariants, properties, and that ReadBlob round-trips the
    inner payload byte-for-byte equal to the standalone Java fixture. Writer
    side: regenerates the envelope in-memory and asserts byte equality
    against the on-disk fixture. Without the writer-side check the test was
    reader-only round-trip and writer drift would silently calcify into the
    next regeneration.
    
    Honest framing: this is a Go-writer wire-format pin with a Java-
    equivalent inner payload, not a strong Java cross-impl pin. The basic
    envelope shape is cross-checked by TestWriterBitIdenticalWithJava, but
    that test does not exercise empty Fields arrays or multi-key blob
    Properties — both of which this fixture relies on — and JSON key
    ordering of blob Properties is encoder-defined.
    
    A regen test gated on REGEN_FIXTURES=1 reproduces the .puffin from the
    inner payload and self-validates by reading the blob back before
    overwriting the on-disk file.
    
    Closes #1008.
---
 puffin/dv_golden_test.go                       | 167 +++++++++++++++++++++++++
 puffin/gen_dv_fixture.go                       | 110 ++++++++++++++++
 puffin/testdata/README.md                      |  48 ++++++-
 puffin/testdata/deletion-vector-v1-payload.bin | Bin 0 -> 50 bytes
 puffin/testdata/deletion-vector-v1.puffin      | Bin 0 -> 314 bytes
 5 files changed, 324 insertions(+), 1 deletion(-)

diff --git a/puffin/dv_golden_test.go b/puffin/dv_golden_test.go
new file mode 100644
index 00000000..2464b582
--- /dev/null
+++ b/puffin/dv_golden_test.go
@@ -0,0 +1,167 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package puffin_test
+
+//go:generate go run gen_dv_fixture.go
+
+import (
+       "bytes"
+       "os"
+       "path/filepath"
+       "testing"
+
+       "github.com/apache/iceberg-go/puffin"
+       "github.com/stretchr/testify/assert"
+       "github.com/stretchr/testify/require"
+)
+
+// dvFixturePayloadName is the standalone Java-produced 64-bit roaring DV
+// payload lifted from apache/iceberg's test resources. The inner shape per
+// Iceberg spec: 4-byte BE length, 4-byte 0xD1D33964 magic, serialized
+// roaring bitmap, 4-byte BE CRC32. Source:
+// 
https://github.com/apache/iceberg/blob/main/core/src/test/resources/org/apache/iceberg/deletes/small-alternating-values-position-index.bin
+const dvFixturePayloadName = "deletion-vector-v1-payload.bin"
+
+// dvFixturePuffinName is the complete puffin file wrapping the Java-produced
+// DV payload. The envelope is what puffin.Writer emits today; this is a
+// Go-writer wire-format pin, not a strong Java cross-impl pin. (The basic
+// puffin envelope shape is cross-checked by TestWriterBitIdenticalWithJava,
+// but that test does not exercise empty Fields arrays or multi-key blob
+// Properties, both of which this fixture relies on. JSON key ordering of
+// blob Properties is also encoder-defined.) Regenerate via
+// `go generate ./puffin/...` (driven by gen_dv_fixture.go).
+const dvFixturePuffinName = "deletion-vector-v1.puffin"
+
+// dvFixtureReferencedDataFile is the placeholder data-file path stored in
+// the blob's properties. Spec: every DV blob carries `referenced-data-file`
+// pointing at the parquet file it deletes from. The exact string is a fixture
+// choice; matching against any specific Java-emitted file would require a
+// matching string, which is not currently checked in upstream.
+const dvFixtureReferencedDataFile = "data/test.parquet"
+
+// dvFixtureCardinality is the cardinality property — the count of deleted
+// row positions encoded inside the roaring bitmap. String form because puffin
+// blob properties are stringly-typed (map[string]string). The bitmap encodes
+// 5 positions: 1, 3, 5, 7, 9.
+const dvFixtureCardinality = "5"
+
+// buildDVFixture returns a puffin envelope wrapping the given Java-produced
+// DV payload, with the canonical metadata for a deletion-vector-v1 blob.
+// The same builder is used by both the regen path and the wire-format pin,
+// so any drift in puffin.Writer surfaces as a byte-mismatch against the
+// checked-in fixture rather than calcifying into a regenerated golden.
+func buildDVFixture(t *testing.T, payload []byte) []byte {
+       t.Helper()
+       buf := &bytes.Buffer{}
+       w, err := puffin.NewWriter(buf)
+       require.NoError(t, err)
+       require.NoError(t, w.SetCreatedBy("iceberg-go test fixture"))
+
+       _, err = w.AddBlob(puffin.BlobMetadataInput{
+               Type:           puffin.BlobTypeDeletionVector,
+               SnapshotID:     -1,
+               SequenceNumber: -1,
+               Fields:         []int32{},
+               Properties: map[string]string{
+                       "referenced-data-file": dvFixtureReferencedDataFile,
+                       "cardinality":          dvFixtureCardinality,
+               },
+       }, payload)
+       require.NoError(t, err)
+       require.NoError(t, w.Finish())
+
+       return buf.Bytes()
+}
+
+// TestDeletionVectorPuffinWireFormat is a cross-implementation wire-format
+// pin for puffin envelopes wrapping deletion-vector-v1 blobs. Two layers of
+// guarantee:
+//
+//   - The inner roaring payload is byte-equal to a Java-produced fixture
+//     lifted directly from apache/iceberg test resources. If the puffin
+//     reader ever mangles blob bytes on the way out, this fails.
+//
+//   - The on-disk envelope bytes are byte-equal to what puffin.Writer
+//     re-emits today for the same input. Any drift in the writer (footer
+//     JSON shape, key ordering, properties handling, magic placement)
+//     surfaces here instead of calcifying silently into the next regen.
+//
+// Independent of #866 (the roaring decoder PR) — uses raw bytes only.
+func TestDeletionVectorPuffinWireFormat(t *testing.T) {
+       puffinBytes, err := os.ReadFile(filepath.Join("testdata", 
dvFixturePuffinName))
+       require.NoError(t, err, "fixture missing — regenerate with `go generate 
./puffin/...`")
+
+       expectedPayload, err := os.ReadFile(filepath.Join("testdata", 
dvFixturePayloadName))
+       require.NoError(t, err)
+
+       // Writer-side pin: the checked-in envelope must equal what 
puffin.Writer
+       // produces today for the same input. Without this, writer regressions
+       // slip through the read-side assertions because both sides drift
+       // together.
+       freshBytes := buildDVFixture(t, expectedPayload)
+       if !bytes.Equal(freshBytes, puffinBytes) {
+               diffAt := -1
+               for i := 0; i < len(freshBytes) && i < len(puffinBytes); i++ {
+                       if freshBytes[i] != puffinBytes[i] {
+                               diffAt = i
+
+                               break
+                       }
+               }
+               t.Fatalf("checked-in envelope no longer matches puffin.Writer 
output. "+
+                       "First diff at byte %d (fixture=%d bytes, fresh=%d 
bytes). "+
+                       "Either a deliberate format change (regenerate with "+
+                       "`go generate ./puffin/...` and review the diff) or a 
writer regression.",
+                       diffAt, len(puffinBytes), len(freshBytes))
+       }
+
+       // Magic bytes at file head and tail.
+       r, err := puffin.NewReader(bytes.NewReader(puffinBytes))
+       require.NoError(t, err)
+
+       // Read-side pin: blob count, type, spec-mandated invariants.
+       blobs := r.Blobs()
+       require.Len(t, blobs, 1, "fixture should contain exactly one DV blob")
+
+       blob := blobs[0]
+       assert.Equal(t, puffin.BlobTypeDeletionVector, blob.Type)
+       assert.Equal(t, int64(-1), blob.SnapshotID,
+               "deletion-vector-v1 spec requires snapshot-id=-1")
+       assert.Equal(t, int64(-1), blob.SequenceNumber,
+               "deletion-vector-v1 spec requires sequence-number=-1")
+       // Strict empty (not nil): Java's parser rejects "fields": null. 
Asserting
+       // the concrete []int32{} value catches a future regression that would
+       // emit null instead of [].
+       assert.Equal(t, []int32{}, blob.Fields,
+               "DV blob fields must be an explicit empty array per spec")
+       assert.Nil(t, blob.CompressionCodec,
+               "this fixture is uncompressed (per fixture choice, not spec)")
+       assert.Len(t, blob.Properties, 2,
+               "DV blob should carry exactly the two spec-canonical 
properties; "+
+                       "a writer regression that emits extra keys must fail 
here too")
+       assert.Equal(t, dvFixtureReferencedDataFile, 
blob.Properties["referenced-data-file"])
+       assert.Equal(t, dvFixtureCardinality, blob.Properties["cardinality"])
+       assert.Equal(t, int64(len(expectedPayload)), blob.Length,
+               "blob length should equal the Java-produced payload")
+
+       // Reader returns the inner payload byte-for-byte equal to the Java 
fixture.
+       got, err := r.ReadBlob(0)
+       require.NoError(t, err)
+       assert.Equal(t, expectedPayload, got.Data,
+               "reader should round-trip the Java-produced payload bytes 
unmodified")
+}
diff --git a/puffin/gen_dv_fixture.go b/puffin/gen_dv_fixture.go
new file mode 100644
index 00000000..c69e2e1f
--- /dev/null
+++ b/puffin/gen_dv_fixture.go
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//go:build ignore
+
+// Regenerates puffin/testdata/deletion-vector-v1.puffin from the Java-
+// produced inner payload (deletion-vector-v1-payload.bin). Invoked via
+//
+//     go generate ./puffin/...
+//
+// After regen, diff the file before committing to confirm only the
+// intended bytes changed:
+//
+//     git diff -- puffin/testdata/deletion-vector-v1.puffin
+package main
+
+import (
+       "bytes"
+       "fmt"
+       "log"
+       "os"
+       "path/filepath"
+
+       "github.com/apache/iceberg-go/puffin"
+)
+
+const (
+       payloadName        = "deletion-vector-v1-payload.bin"
+       puffinName         = "deletion-vector-v1.puffin"
+       referencedDataFile = "data/test.parquet"
+       cardinality        = "5"
+)
+
+func main() {
+       // go generate runs in the directory of the file carrying the
+       // //go:generate directive (puffin/); testdata sits alongside.
+       payloadPath := filepath.Join("testdata", payloadName)
+       outPath := filepath.Join("testdata", puffinName)
+
+       payload, err := os.ReadFile(payloadPath)
+       if err != nil {
+               log.Fatalf("read payload: %v", err)
+       }
+
+       buf := &bytes.Buffer{}
+       w, err := puffin.NewWriter(buf)
+       if err != nil {
+               log.Fatalf("new writer: %v", err)
+       }
+       if err := w.SetCreatedBy("iceberg-go test fixture"); err != nil {
+               log.Fatalf("set created-by: %v", err)
+       }
+       if _, err := w.AddBlob(puffin.BlobMetadataInput{
+               Type:           puffin.BlobTypeDeletionVector,
+               SnapshotID:     -1,
+               SequenceNumber: -1,
+               Fields:         []int32{},
+               Properties: map[string]string{
+                       "referenced-data-file": referencedDataFile,
+                       "cardinality":          cardinality,
+               },
+       }, payload); err != nil {
+               log.Fatalf("add blob: %v", err)
+       }
+       if err := w.Finish(); err != nil {
+               log.Fatalf("finish: %v", err)
+       }
+
+       // Self-validate before overwriting: parse what we just produced, read
+       // the blob back, and confirm both the envelope-level invariants and
+       // the inner-payload bytes survive the round-trip. Without the ReadBlob
+       // step a writer bug producing valid-but-mismatched blob offsets/lengths
+       // would still pass parse-only validation and calcify into the fixture.
+       r, err := puffin.NewReader(bytes.NewReader(buf.Bytes()))
+       if err != nil {
+               log.Fatalf("regen produced an unreadable puffin file: %v", err)
+       }
+       if n := len(r.Blobs()); n != 1 {
+               log.Fatalf("regen produced %d blobs, want 1", n)
+       }
+       if got, want := r.Blobs()[0].Type, puffin.BlobTypeDeletionVector; got 
!= want {
+               log.Fatalf("regen produced blob type %q, want %q", got, want)
+       }
+       got, err := r.ReadBlob(0)
+       if err != nil {
+               log.Fatalf("regen produced an unreadable blob: %v", err)
+       }
+       if !bytes.Equal(payload, got.Data) {
+               log.Fatal("regen round-trip mangled the inner payload")
+       }
+
+       if err := os.WriteFile(outPath, buf.Bytes(), 0o644); err != nil {
+               log.Fatalf("write fixture: %v", err)
+       }
+       fmt.Printf("wrote %s (%d bytes)\n", outPath, buf.Len())
+}
diff --git a/puffin/testdata/README.md b/puffin/testdata/README.md
index 084df93c..b39e5d62 100644
--- a/puffin/testdata/README.md
+++ b/puffin/testdata/README.md
@@ -17,5 +17,51 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-These test fixture files are canonical Puffin files from the Apache Iceberg 
Java implementation:
+## Canonical fixtures from apache/iceberg
+
+`empty-puffin-uncompressed.bin`, `sample-metric-data-uncompressed.bin`, and
+`sample-metric-data-compressed-zstd.bin` are canonical Puffin files from the
+Apache Iceberg Java implementation:
 
https://github.com/apache/iceberg/tree/main/core/src/test/resources/org/apache/iceberg/puffin/v1
+
+## Deletion-vector cross-impl fixtures
+
+`deletion-vector-v1-payload.bin` is a Java-produced 64-bit Roaring deletion
+vector payload lifted directly from apache/iceberg's test resources. 50 bytes
+total: 4-byte BE length, 4-byte 0xD1D33964 magic, serialized roaring bitmap
+(38 bytes), 4-byte BE CRC32. The bitmap encodes 5 deleted positions
+(1, 3, 5, 7, 9). Source:
+https://github.com/apache/iceberg/blob/main/core/src/test/resources/org/apache/iceberg/deletes/small-alternating-values-position-index.bin
+
+`deletion-vector-v1.puffin` wraps that payload in a complete Puffin envelope:
+blob type `deletion-vector-v1`, snapshot-id and sequence-number set to -1
+per spec, with `referenced-data-file` and `cardinality` properties. The
+envelope is what `puffin.Writer` emits today; this is a Go-writer wire-
+format pin, not a strong Java cross-impl pin. The basic envelope shape is
+cross-checked by `TestWriterBitIdenticalWithJava`, but that test does not
+exercise empty `Fields` arrays or multi-key blob `Properties` — both of
+which this fixture relies on — and JSON key ordering of blob `Properties`
+is encoder-defined. The property values
+(`referenced-data-file=data/test.parquet`, `cardinality=5`,
+`created-by="iceberg-go test fixture"`) are fixture choices, not bytes
+inherited from any specific Java-emitted file.
+
+To regenerate after a deliberate puffin-format change:
+
+```
+go generate ./puffin/...
+```
+
+The generator lives in `puffin/gen_dv_fixture.go` (built with the
+`//go:build ignore` tag and run via the `//go:generate` directive in
+`puffin/dv_golden_test.go`). It self-validates by reading the freshly-
+written envelope back before overwriting the on-disk file, so a writer
+bug producing a valid-but-unreadable file fails the regen rather than
+calcifying into the fixture.
+
+After regen, diff the file before committing to verify only intended bytes
+changed:
+
+```
+git diff -- puffin/testdata/deletion-vector-v1.puffin
+```
diff --git a/puffin/testdata/deletion-vector-v1-payload.bin 
b/puffin/testdata/deletion-vector-v1-payload.bin
new file mode 100644
index 00000000..80829fae
Binary files /dev/null and b/puffin/testdata/deletion-vector-v1-payload.bin 
differ
diff --git a/puffin/testdata/deletion-vector-v1.puffin 
b/puffin/testdata/deletion-vector-v1.puffin
new file mode 100644
index 00000000..84c6baf7
Binary files /dev/null and b/puffin/testdata/deletion-vector-v1.puffin differ

Reply via email to