laskoviymishka commented on code in PR #1045:
URL: https://github.com/apache/iceberg-go/pull/1045#discussion_r3215462598
##########
schema.go:
##########
@@ -470,13 +479,26 @@ func (s *Schema) Select(caseSensitive bool, names
...string) (*Schema, error) {
for _, n := range names {
id, ok := nameMap[strings.ToLower(n)]
if !ok {
+ field, isMeta := isRowLineageMeta(n)
+ if isMeta {
+ missingMetaFields =
append(missingMetaFields, field)
+
+ continue
+ }
+
return nil, fmt.Errorf("%w: could not find
column %s", ErrInvalidSchema, n)
}
ids[id] = void
}
}
- return PruneColumns(s, ids, true)
+ prunedSchema, err := PruneColumns(s, ids, true)
+
+ if len(missingMetaFields) == 0 {
+ return prunedSchema, err
+ }
+
+ return NewSchema(prunedSchema.ID, append(prunedSchema.fields,
missingMetaFields...)...), nil
Review Comment:
Several concerns concentrated on this final return:
**Blocking — nil-deref.** When `PruneColumns` returns `(nil, err)` and
`missingMetaFields` is non-empty, the early-return guard above is skipped and
`prunedSchema.ID` / `prunedSchema.fields` panic. Need an `if err != nil {
return nil, err }` before this line.
**Drops `IdentifierFieldIDs`.** `NewSchema` calls
`NewSchemaWithIdentifiers(id, []int{}, ...)`. `PruneColumns` filters
identifiers correctly and the fast path returns them; this slow path discards
them. Use `NewSchemaWithIdentifiers(prunedSchema.ID,
prunedSchema.IdentifierFieldIDs, ...)`.
**Latent panic on duplicate names.** If a caller passes `Select(true,
"_row_id", "_row_id")`, `missingMetaFields` ends up with two `RowID()` entries
(same field ID), and `checkDuplicateFieldIDs` inside `NewSchemaWithIdentifiers`
panics. `Select`'s doc says it returns errors — would dedupe
`missingMetaFields` (or detect duplicates earlier and return
`ErrInvalidSchema`).
**Aliasing.** `append(prunedSchema.fields, ...)` writes through the
unexported slice. `prunedSchema` is fresh from `PruneColumns` so cap usually
equals len, but if pruning ever pre-allocates extra cap, the append aliases.
`slices.Clone(prunedSchema.fields)` makes the no-aliasing intent explicit.
Suggested shape:
```go
prunedSchema, err := PruneColumns(s, ids, true)
if err != nil {
return nil, err
}
if len(missingMetaFields) == 0 {
return prunedSchema, nil
}
return NewSchemaWithIdentifiers(
prunedSchema.ID,
prunedSchema.IdentifierFieldIDs,
append(slices.Clone(prunedSchema.fields), missingMetaFields...)...,
), nil
```
##########
metadata_columns.go:
##########
@@ -59,3 +61,15 @@ func LastUpdatedSequenceNumber() NestedField {
func IsMetadataColumn(fieldID int) bool {
return fieldID == RowIDFieldID || fieldID ==
LastUpdatedSequenceNumberFieldID
}
+
+func isRowLineageMeta(name string) (NestedField, bool) {
Review Comment:
Nit: `isRowLineageMeta` reads as a pure predicate but returns `(NestedField,
bool)`. Conventional Go names this `rowLineageMetaField` or
`findRowLineageMeta` — matches the `findX` `(T, bool)` convention and reads
better at the call sites in `schema.go`.
##########
schema.go:
##########
@@ -448,6 +448,7 @@ var void = Void{}
// An error is returned if a requested name cannot be found.
func (s *Schema) Select(caseSensitive bool, names ...string) (*Schema, error) {
Review Comment:
Architecturally, would consider keeping `Schema.Select` as a pure structural
prune and moving the row-lineage synthesis up into `Scan.Projection()` (in
`table/scanner.go`).
Reasons:
- Java keeps `TypeUtil.select` pure and joins the metadata columns in the
scan builder (`BaseSparkScanBuilder.projectionWithMetadataColumns`);
PyIceberg's `Schema.select` raises on missing names unconditionally and
synthesizes meta columns at the scan layer. This PR diverges from both.
- `Schema.Select` is called outside scan paths too (e.g. equality-delete
writers). Teaching it about exactly two reserved names couples the schema
utility to scan-layer concerns.
- The version gate is naturally available in `Projection()` via
`metadata.Version()`, which gives you the v3 check for free (see comment on the
`isRowLineageMeta` call below).
Not blocking the change, but worth considering before this surface gets used
by more callers.
Also: the doc comment two lines up still says "An error is returned if a
requested name cannot be found." — that's no longer true for `_row_id` /
`_last_updated_sequence_number`. Either move the logic or update the doc.
##########
metadata_columns.go:
##########
@@ -59,3 +61,15 @@ func LastUpdatedSequenceNumber() NestedField {
func IsMetadataColumn(fieldID int) bool {
return fieldID == RowIDFieldID || fieldID ==
LastUpdatedSequenceNumberFieldID
}
+
+func isRowLineageMeta(name string) (NestedField, bool) {
+ name = strings.ToLower(name)
Review Comment:
Unconditional `strings.ToLower` short-circuits the case-sensitive branch in
`Schema.Select`. A caller passing `Select(true, "_Row_ID")` should hit
`ErrInvalidSchema` like any other miscased column; today it silently succeeds
and gets a synthetic null `_row_id` field projected.
Would take a `caseSensitive bool` parameter and only fold when `false`:
```go
func rowLineageMetaField(name string, caseSensitive bool) (NestedField,
bool) {
if !caseSensitive {
name = strings.ToLower(name)
}
switch name {
case RowIDColumnName:
return RowID(), true
case LastUpdatedSequenceNumberColumnName:
return LastUpdatedSequenceNumber(), true
}
return NestedField{}, false
}
```
(The constants are already lowercase, so the case-sensitive branch can
compare directly.)
##########
table/scanner_internal_test.go:
##########
@@ -236,6 +236,70 @@ func TestBuildPartitionEvaluatorWithInvalidSpecID(t
*testing.T) {
assert.ErrorContains(t, err, "id 999")
}
+// TestProjectionV3PreLineageFile verifies that Projection() succeeds and
returns
+// _row_id and _last_updated_sequence_number as nullable (all-null-capable)
fields when
+// the table is v3 with next-row-id set but the data file predates row lineage
(those
+// columns are absent from the schema).
+func TestProjectionV3PreLineageFile(t *testing.T) {
Review Comment:
Test coverage is happy-path-only. Consider adding subtests for:
1. **`caseSensitive: false`** — the case-insensitive branch in `Select` got
new code and is untested.
2. **v1/v2 negative case** — assert that `Select(_, "_row_id")` still
returns `ErrInvalidSchema` on a v1/v2 schema (locks in the version gate once
added).
3. **Partial presence** — schema with `_row_id` already in it but
`_last_updated_sequence_number` absent. This is also where the duplicate-ID
panic in `NewSchemaWithIdentifiers` would surface if `_row_id` ends up in both
`ids` (via the name map) and `missingMetaFields`.
4. **Panic-path coverage** — a schema/ids combo that makes `PruneColumns`
return `(nil, err)` while `missingMetaFields` is non-empty (the blocking
nil-deref above).
A direct `Schema.Select` unit test in the `iceberg` package
(`schema_test.go`) would also give clearer attribution than the end-to-end
`Projection()` test, especially if the synthesis later moves out of `Select`.
##########
table/scanner_internal_test.go:
##########
@@ -236,6 +236,70 @@ func TestBuildPartitionEvaluatorWithInvalidSpecID(t
*testing.T) {
assert.ErrorContains(t, err, "id 999")
}
+// TestProjectionV3PreLineageFile verifies that Projection() succeeds and
returns
+// _row_id and _last_updated_sequence_number as nullable (all-null-capable)
fields when
+// the table is v3 with next-row-id set but the data file predates row lineage
(those
+// columns are absent from the schema).
+func TestProjectionV3PreLineageFile(t *testing.T) {
+ schema := iceberg.NewSchema(
+ 1,
+ iceberg.NestedField{ID: 1, Name: "id", Type:
iceberg.PrimitiveTypes.Int64, Required: true},
+ iceberg.NestedField{ID: 2, Name: "payload", Type:
iceberg.PrimitiveTypes.String, Required: false},
+ )
+
+ metadata, err := NewMetadata(
+ schema,
+ iceberg.UnpartitionedSpec,
+ UnsortedSortOrder,
+ "s3://test-bucket/test_table",
+ iceberg.Properties{"format-version": "3"},
+ )
+ require.NoError(t, err)
+ assert.Equal(t, 3, metadata.Version(), "sanity: must be v3")
+ assert.GreaterOrEqual(t, metadata.NextRowID(), int64(0), "sanity:
next-row-id must be set")
Review Comment:
Nit: `>= 0` accepts `0`, but the message says "next-row-id must be set." If
the intent is just "non-negative," the message is fine reworded; if the intent
is "actually initialized to a meaningful value," the assertion is too weak (a
freshly-created v3 table can legitimately start at `0`).
##########
schema.go:
##########
@@ -457,6 +458,14 @@ func (s *Schema) Select(caseSensitive bool, names
...string) (*Schema, error) {
for _, n := range names {
id, ok := nameMap[n]
if !ok {
+ // check if v3 row lineage metadata
Review Comment:
Comment says v3 but there's no version check anywhere in this path. On a v1
or v2 table, `Select(_, "_row_id")` previously returned `ErrInvalidSchema`;
with this PR it silently succeeds and injects a synthetic null field — a
contract change for older tables.
Java gates this on `formatVersion >= 3` via `TableUtil.supportsRowLineage`,
and Spark's scan builder only calls `addLineageIfRequired` when that check
passes. Recommend either threading the format version into `Select` (ugly) or
moving the synthesis into `Scan.Projection()` where `metadata.Version()` is in
scope (clean — see other comment).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]