rraulinio opened a new issue, #1205:
URL: https://github.com/apache/iceberg-go/issues/1205

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   > Follow-up to #1190 / #1191 (requested in [this review 
comment](https://github.com/apache/iceberg-go/pull/1191#issuecomment-4694968108)).
 This is a **consistency / latent-footgun** fix, not a live wrong-results bug; 
#1191 already fixed the one path that miscompared. Filing separately because 
the change has real blast radius.
   
   ## Problem
   
   `PartitionSpec.PartitionType(schema)` **compacts** its result: a partition 
field whose source column is absent from `schema` (e.g. the column was dropped) 
is skipped rather than retained.
   
   This diverges from Java Iceberg, whose `PartitionSpec.partitionType()` 
**retains every field**, substituting `UnknownType` when the source column is 
gone:
   
   ```java
   // When the source field has been dropped we cannot determine the type
   if (sourceType == null) {
     resultType = Types.UnknownType.get();
   }
   ```
   
   Because manifest partition summaries (`ManifestFile.Partitions()` → 
`[]FieldSummary`) are **positional against the full partition spec**, the 
compacted struct cannot be used to index them. #1191 worked around this by 
adding a parallel `manifestPartitionFields` helper in `table/evaluators.go` 
that retains every field with an `UnknownType` placeholder, leaving the 
codebase with **two notions of "partition type"** (compacted vs. full). Any 
future positional consumer that reaches for `PartitionType()` will silently 
reintroduce the #1190 class of bug.
   
   ## Why This Matters
   
   - **Single source of truth.** One method should define a spec's partition 
struct; the `manifestPartitionFields` duplicate exists only to compensate for 
`PartitionType()`'s compaction.
   - **Reference parity.** Java retains dropped-source fields as `UnknownType`; 
matching it removes a subtle behavioral divergence.
   - **Footgun prevention.** The compaction is an implicit trap for positional 
callers, not a documented contract. (#1191 adds a doc note to `PartitionType()` 
warning against positional use as an interim guard.)
   
   ## Expected Behavior
   
   `PartitionType(schema)` retains **all** partition spec fields in order, 
using `UnknownType` for any field whose source column is missing from `schema`, 
matching Java. `manifestPartitionFields` is then deleted and 
`newManifestEvaluator` calls `PartitionType()` directly.
   
   ## Blast Radius (call sites to audit)
   
   `PartitionType()` (and the positional 
`newPartitionRecord`/`GetPartitionRecord` builder that keys values by `FieldID` 
over `FieldList`) is consumed by:
   
   **Write / encode paths** - in practice always invoked with a schema where 
every partition source is present, so compaction is a no-op today; each must 
nonetheless tolerate an `UnknownType` placeholder after the change:
   - `manifest.go`- partition `fieldStats` building, and 
`partitionTypeToAvroSchema(...)` (**avro mapping for `UnknownType` must be 
confirmed**)
   - `data_file_codec.go` - partition encoding
   - `partitions.go` - `PartitionToPath`
   - `table/snapshots.go` - `GetPartitionRecord` for produced data/delete files
   - `table/partitioned_fanout_writer.go`, 
`table/pos_delete_partitioned_fanout_writer.go`
   
   **Read / scan path** - internally consistent today (schema + record built 
from the *same* compacted list), but should keep working once placeholders 
appear:
   - `table/scanner.go` - `buildPartitionEvaluator` (`partSchema` + 
`GetPartitionRecord`); the projected partition filter never references a 
dropped source, so an unreferenced `UnknownType` placeholder must remain 
harmless to `ExpressionEvaluator` binding.
   
   ## Proposed Fix
   
   1. Change `PartitionType()` to emit an `UnknownType` placeholder (instead of 
`continue`) for fields whose source column is absent.
   2. Remove `manifestPartitionFields`; have `newManifestEvaluator` use 
`PartitionType()`.
   3. Audit the consumers above - primarily that avro partition-schema 
conversion and the data-file codec tolerate/guard `UnknownType`, and that 
evaluator binding ignores unreferenced placeholder fields.
   4. Tests: dropped-source partition field round-trips positionally through 
manifest evaluation (the #1191 regression test), plus avro/codec encode and 
scan-evaluator coverage with a dropped source present.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to