Re: [PR] feat(table): project reserved row-lineage fields as null when the file lacks them [iceberg-go]

via GitHub Sun, 10 May 2026 13:27:31 -0700


laskoviymishka commented on code in PR #1045:
URL: https://github.com/apache/iceberg-go/pull/1045#discussion_r3215462598



##########
schema.go:
##########
@@ -470,13 +479,26 @@ func (s *Schema) Select(caseSensitive bool, names 
...string) (*Schema, error) {
                for _, n := range names {
                        id, ok := nameMap[strings.ToLower(n)]
                        if !ok {
+                               field, isMeta := isRowLineageMeta(n)
+                               if isMeta {
+                                       missingMetaFields = 
append(missingMetaFields, field)
+
+                                       continue
+                               }
+
                                return nil, fmt.Errorf("%w: could not find 
column %s", ErrInvalidSchema, n)
                        }
                        ids[id] = void
                }
        }
 
-       return PruneColumns(s, ids, true)
+       prunedSchema, err := PruneColumns(s, ids, true)
+
+       if len(missingMetaFields) == 0 {
+               return prunedSchema, err
+       }
+
+       return NewSchema(prunedSchema.ID, append(prunedSchema.fields, 
missingMetaFields...)...), nil

Review Comment:
   Several concerns concentrated on this final return:
   
   **Blocking — nil-deref.** When `PruneColumns` returns `(nil, err)` and 
`missingMetaFields` is non-empty, the early-return guard above is skipped and 
`prunedSchema.ID` / `prunedSchema.fields` panic. Need an `if err != nil { 
return nil, err }` before this line.
   
   **Drops `IdentifierFieldIDs`.** `NewSchema` calls 
`NewSchemaWithIdentifiers(id, []int{}, ...)`. `PruneColumns` filters 
identifiers correctly and the fast path returns them; this slow path discards 
them. Use `NewSchemaWithIdentifiers(prunedSchema.ID, 
prunedSchema.IdentifierFieldIDs, ...)`.
   
   **Latent panic on duplicate names.** If a caller passes `Select(true, 
"_row_id", "_row_id")`, `missingMetaFields` ends up with two `RowID()` entries 
(same field ID), and `checkDuplicateFieldIDs` inside `NewSchemaWithIdentifiers` 
panics. `Select`'s doc says it returns errors — would dedupe 
`missingMetaFields` (or detect duplicates earlier and return 
`ErrInvalidSchema`).
   
   **Aliasing.** `append(prunedSchema.fields, ...)` writes through the 
unexported slice. `prunedSchema` is fresh from `PruneColumns` so cap usually 
equals len, but if pruning ever pre-allocates extra cap, the append aliases. 
`slices.Clone(prunedSchema.fields)` makes the no-aliasing intent explicit.
   
   Suggested shape:
   
   ```go
   prunedSchema, err := PruneColumns(s, ids, true)
   if err != nil {
       return nil, err
   }
   if len(missingMetaFields) == 0 {
       return prunedSchema, nil
   }
   return NewSchemaWithIdentifiers(
       prunedSchema.ID,
       prunedSchema.IdentifierFieldIDs,
       append(slices.Clone(prunedSchema.fields), missingMetaFields...)...,
   ), nil
   ```



##########
metadata_columns.go:
##########
@@ -59,3 +61,15 @@ func LastUpdatedSequenceNumber() NestedField {
 func IsMetadataColumn(fieldID int) bool {
        return fieldID == RowIDFieldID || fieldID == 
LastUpdatedSequenceNumberFieldID
 }
+
+func isRowLineageMeta(name string) (NestedField, bool) {

Review Comment:
   Nit: `isRowLineageMeta` reads as a pure predicate but returns `(NestedField, 
bool)`. Conventional Go names this `rowLineageMetaField` or 
`findRowLineageMeta` — matches the `findX` `(T, bool)` convention and reads 
better at the call sites in `schema.go`.



##########
schema.go:
##########
@@ -448,6 +448,7 @@ var void = Void{}
 // An error is returned if a requested name cannot be found.
 func (s *Schema) Select(caseSensitive bool, names ...string) (*Schema, error) {

Review Comment:
   Architecturally, would consider keeping `Schema.Select` as a pure structural 
prune and moving the row-lineage synthesis up into `Scan.Projection()` (in 
`table/scanner.go`).
   
   Reasons:
   - Java keeps `TypeUtil.select` pure and joins the metadata columns in the 
scan builder (`BaseSparkScanBuilder.projectionWithMetadataColumns`); 
PyIceberg's `Schema.select` raises on missing names unconditionally and 
synthesizes meta columns at the scan layer. This PR diverges from both.
   - `Schema.Select` is called outside scan paths too (e.g. equality-delete 
writers). Teaching it about exactly two reserved names couples the schema 
utility to scan-layer concerns.
   - The version gate is naturally available in `Projection()` via 
`metadata.Version()`, which gives you the v3 check for free (see comment on the 
`isRowLineageMeta` call below).
   
   Not blocking the change, but worth considering before this surface gets used 
by more callers.
   
   Also: the doc comment two lines up still says "An error is returned if a 
requested name cannot be found." — that's no longer true for `_row_id` / 
`_last_updated_sequence_number`. Either move the logic or update the doc.



##########
metadata_columns.go:
##########
@@ -59,3 +61,15 @@ func LastUpdatedSequenceNumber() NestedField {
 func IsMetadataColumn(fieldID int) bool {
        return fieldID == RowIDFieldID || fieldID == 
LastUpdatedSequenceNumberFieldID
 }
+
+func isRowLineageMeta(name string) (NestedField, bool) {
+       name = strings.ToLower(name)

Review Comment:
   Unconditional `strings.ToLower` short-circuits the case-sensitive branch in 
`Schema.Select`. A caller passing `Select(true, "_Row_ID")` should hit 
`ErrInvalidSchema` like any other miscased column; today it silently succeeds 
and gets a synthetic null `_row_id` field projected.
   
   Would take a `caseSensitive bool` parameter and only fold when `false`:
   
   ```go
   func rowLineageMetaField(name string, caseSensitive bool) (NestedField, 
bool) {
       if !caseSensitive {
           name = strings.ToLower(name)
       }
       switch name {
       case RowIDColumnName:
           return RowID(), true
       case LastUpdatedSequenceNumberColumnName:
           return LastUpdatedSequenceNumber(), true
       }
       return NestedField{}, false
   }
   ```
   
   (The constants are already lowercase, so the case-sensitive branch can 
compare directly.)



##########
table/scanner_internal_test.go:
##########
@@ -236,6 +236,70 @@ func TestBuildPartitionEvaluatorWithInvalidSpecID(t 
*testing.T) {
        assert.ErrorContains(t, err, "id 999")
 }
 
+// TestProjectionV3PreLineageFile verifies that Projection() succeeds and 
returns
+// _row_id and _last_updated_sequence_number as nullable (all-null-capable) 
fields when
+// the table is v3 with next-row-id set but the data file predates row lineage 
(those
+// columns are absent from the schema).
+func TestProjectionV3PreLineageFile(t *testing.T) {

Review Comment:
   Test coverage is happy-path-only. Consider adding subtests for:
   
   1. **`caseSensitive: false`** — the case-insensitive branch in `Select` got 
new code and is untested.
   2. **v1/v2 negative case** — assert that `Select(_, "_row_id")` still 
returns `ErrInvalidSchema` on a v1/v2 schema (locks in the version gate once 
added).
   3. **Partial presence** — schema with `_row_id` already in it but 
`_last_updated_sequence_number` absent. This is also where the duplicate-ID 
panic in `NewSchemaWithIdentifiers` would surface if `_row_id` ends up in both 
`ids` (via the name map) and `missingMetaFields`.
   4. **Panic-path coverage** — a schema/ids combo that makes `PruneColumns` 
return `(nil, err)` while `missingMetaFields` is non-empty (the blocking 
nil-deref above).
   
   A direct `Schema.Select` unit test in the `iceberg` package 
(`schema_test.go`) would also give clearer attribution than the end-to-end 
`Projection()` test, especially if the synthesis later moves out of `Select`.



##########
table/scanner_internal_test.go:
##########
@@ -236,6 +236,70 @@ func TestBuildPartitionEvaluatorWithInvalidSpecID(t 
*testing.T) {
        assert.ErrorContains(t, err, "id 999")
 }
 
+// TestProjectionV3PreLineageFile verifies that Projection() succeeds and 
returns
+// _row_id and _last_updated_sequence_number as nullable (all-null-capable) 
fields when
+// the table is v3 with next-row-id set but the data file predates row lineage 
(those
+// columns are absent from the schema).
+func TestProjectionV3PreLineageFile(t *testing.T) {
+       schema := iceberg.NewSchema(
+               1,
+               iceberg.NestedField{ID: 1, Name: "id", Type: 
iceberg.PrimitiveTypes.Int64, Required: true},
+               iceberg.NestedField{ID: 2, Name: "payload", Type: 
iceberg.PrimitiveTypes.String, Required: false},
+       )
+
+       metadata, err := NewMetadata(
+               schema,
+               iceberg.UnpartitionedSpec,
+               UnsortedSortOrder,
+               "s3://test-bucket/test_table",
+               iceberg.Properties{"format-version": "3"},
+       )
+       require.NoError(t, err)
+       assert.Equal(t, 3, metadata.Version(), "sanity: must be v3")
+       assert.GreaterOrEqual(t, metadata.NextRowID(), int64(0), "sanity: 
next-row-id must be set")

Review Comment:
   Nit: `>= 0` accepts `0`, but the message says "next-row-id must be set." If 
the intent is just "non-negative," the message is fine reworded; if the intent 
is "actually initialized to a meaningful value," the assertion is too weak (a 
freshly-created v3 table can legitimately start at `0`).



##########
schema.go:
##########
@@ -457,6 +458,14 @@ func (s *Schema) Select(caseSensitive bool, names 
...string) (*Schema, error) {
                for _, n := range names {
                        id, ok := nameMap[n]
                        if !ok {
+                               // check if v3 row lineage metadata

Review Comment:
   Comment says v3 but there's no version check anywhere in this path. On a v1 
or v2 table, `Select(_, "_row_id")` previously returned `ErrInvalidSchema`; 
with this PR it silently succeeds and injects a synthetic null field — a 
contract change for older tables.
   
   Java gates this on `formatVersion >= 3` via `TableUtil.supportsRowLineage`, 
and Spark's scan builder only calls `addLineageIfRequired` when that check 
passes. Recommend either threading the format version into `Select` (ugly) or 
moving the synthesis into `Scan.Projection()` where `metadata.Version()` is in 
scope (clean — see other comment).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(table): project reserved row-lineage fields as null when the file lacks them [iceberg-go]

Reply via email to