tanmayrauth commented on code in PR #1021:
URL: https://github.com/apache/iceberg-go/pull/1021#discussion_r3204693327
##########
table/snapshot_producers.go:
##########
@@ -733,7 +733,20 @@ func (sp *snapshotProducer) manifestProducer(content
iceberg.ManifestContent, fi
}
defer internal.CheckedClose(wr, &err)
+ // For v3 data manifests, assign first_row_id to each data file.
+ // Each file claims a contiguous range of row IDs starting from
NextRowID.
+ assignRowIDs := sp.txn.meta.formatVersion >= 3 && content ==
iceberg.ManifestContentData
+ nextRowID := sp.txn.meta.NextRowID()
Review Comment:
Yeah, this was a real correctness issue. The manifest-list rebuild on retry
intentionally treats data manifests as immutable, and baking in per-file IDs at
producer-build time breaks that invariant. On contention, stale IDs would get
rejected by validateAndUpdateRowLineage deterministically, making retries
useless.
##########
table/snapshot_producers.go:
##########
@@ -733,7 +733,20 @@ func (sp *snapshotProducer) manifestProducer(content
iceberg.ManifestContent, fi
}
defer internal.CheckedClose(wr, &err)
+ // For v3 data manifests, assign first_row_id to each data file.
+ // Each file claims a contiguous range of row IDs starting from
NextRowID.
+ assignRowIDs := sp.txn.meta.formatVersion >= 3 && content ==
iceberg.ManifestContentData
Review Comment:
You're correct, and I should have read the spec more carefully here. I went
back and re-read the "First Row ID Inheritance" section - ADDED files carry
null, manifest-list writer assigns the manifest-level value, readers derive
per-file from record_count sums. We already implement both halves of that
(list writer at manifest.go:1508, reader at manifest.go:785). So this was just
duplicating existing machinery while also creating a second source of truth
that would break Java replay.
##########
manifest.go:
##########
@@ -2187,6 +2187,19 @@ func (b *DataFileBuilder) FirstRowID(id int64)
*DataFileBuilder {
return b
}
+// SetDataFileFirstRowID sets the first_row_id on an existing DataFile.
+// This is used by the snapshot producer to assign row IDs at write time for
v3 tables.
+// Returns false if the DataFile implementation does not support mutation.
+func SetDataFileFirstRowID(df DataFile, id int64) bool {
Review Comment:
Yeah, you're right - this was a hack. Type-asserting to the unexported
concrete type and silently returning false is not a contract anyone should have
to reason about. I've removed it entirely in the latest push since the per-file
assignment approach is gone now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]