kevinjqliu opened a new issue, #16414:
URL: https://github.com/apache/iceberg/issues/16414

   This issue is ai assisted, and I have verified all the details.
   
   ## Summary
   
   The Iceberg spec says the `day` partition transform has result type `int`, 
where the value is the number of days from `1970-01-01`.
   
   A literal reading of the spec plus the Avro type mapping implies that 
manifest partition fields produced by `day(...)` should be encoded as plain 
Avro `int`:
   
   ```json
   "type": "int"
   ```
   
   However, several Iceberg implementations model `day(...)` partition fields 
as an Iceberg `date` / Avro logical `date`, and therefore write:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   Both forms have the same Avro physical encoding: an integer day count. The 
difference is the Avro schema logical type annotation. This can still affect 
interoperability because some readers validate manifest schemas strictly.
   
   This issue is to document the ambiguity and raise a discussion about the 
intended canonical behavior.
   
   ---
   
   ## Quick comparison
   
   | Implementation | Internal result type for `day(...)` | Manifest Avro 
schema written | Accepts plain Avro `int`? | Accepts Avro `int` + `logicalType: 
date`? |
   |---|---|---|---:|---:|
   | Spec literal interpretation | `int` | plain Avro `int` | N/A | N/A |
   | Java Iceberg, `apache/iceberg` | `DateType` | Avro logical `date` | Yes | 
Yes |
   | PyIceberg, `apache/iceberg-python` | `DateType` | Avro logical `date` | 
Generally yes in normal decode path; projection caveat | Yes |
   | Go, `apache/iceberg-go` current `main` | `Int32Type` | plain Avro `int` | 
Yes | Not clearly guaranteed on merged `main`; open PR proposes logical `date` |
   | Rust, `apache/iceberg-rust` current `main` | `Date` | Avro logical `date` 
| Not clearly guaranteed in normal manifest reader | Yes |
   
   ---
   
   ## Spec references and interpretation
   
   ### Partition transform result type
   
   The partition transform table says:
   
   > `day` Extract a date or timestamp day, as days from 1970-01-01 `date`, 
`timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` `int`
   
   Source: https://iceberg.apache.org/spec/#partition-transforms
   
   This explicitly lists the result type for `day` as `int`, not `date`.
   
   ### Manifest partition field schema
   
   The manifest section says:
   
   > A manifest is an immutable Avro file that lists data files or delete 
files, along with each file's partition data tuple, metrics, and tracking 
information.
   
   For the `partition` field in a manifest entry, the spec says:
   
   > Partition data tuple, schema based on the partition spec output using 
partition field ids for the struct field ids
   
   Source: https://iceberg.apache.org/spec/#manifests
   
   This means the Avro schema for the manifest `partition` struct is derived 
from the partition spec output. Therefore, the transform result type matters 
for the Avro schema used in the manifest.
   
   ### Avro type mapping
   
   Appendix A says:
   
   > Values should be stored in Avro using the Avro types and logical type 
annotations in the table below.
   
   The Avro mapping table distinguishes Iceberg `int` from Iceberg `date`:
   
   | Iceberg type | Avro representation | Note |
   |---|---|---|
   | `int` | `int` | |
   | `date` | `{ "type": "int", "logicalType": "date" }` | Stores days from 
1970-01-01. |
   
   Source: 
https://iceberg.apache.org/spec/#appendix-a-format-specific-requirements
   
   ### Literal spec interpretation
   
   Because `day(...)` has result type `int`, and Iceberg `int` maps to plain 
Avro `int`, the literal spec interpretation appears to be:
   
   ```json
   "type": "int"
   ```
   
   not:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   The spec uses the word `should` for the Avro mapping, so this may not be an 
absolute `MUST`-style requirement. But it is unclear whether the intended 
canonical manifest schema for `day(...)` partition fields is plain `int` or 
logical `date`.
   
   ---
   
   ## Implementation behavior
   
   ## 1. Java Iceberg: `apache/iceberg`
   
   ### What Java writes
   
   Java Iceberg models `day(...)` as `DateType`, not `IntegerType`.
   
   Relevant code paths:
   
   - `Dates.getResultType(...)`: 
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Dates.java
   - `Timestamps.getResultType(...)`: 
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
   - `TypeToSchema.DATE_SCHEMA`: 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/TypeToSchema.java
   
   For day granularity, both date and timestamp transforms return 
`Types.DateType.get()`. Java's Avro conversion writes `DateType` as Avro `int` 
annotated with logical type `date`.
   
   So Java writes manifest partition fields for `day(...)` as:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   ### What Java accepts
   
   Java Iceberg accepts both forms.
   
   | Manifest Avro schema form | Accepted? | Behavior |
   |---|---:|---|
   | Plain Avro `int` | Yes | Reads as raw integer day count |
   | Avro `int` with `logicalType: date` | Yes | Reads as raw integer day count 
|
   
   Relevant reader path:
   
   - `InternalReader.primitive(...)`: 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/InternalReader.java
   
   The Java Avro reader maps Avro logical type `date` to `ValueReaders.ints()`. 
Plain Avro `INT` also maps to `ValueReaders.ints()`, except for the generic 
int-to-long widening case.
   
   ### Java summary
   
   Java writes the de facto implementation convention:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   but reads both the Java-style annotated form and the spec-literal plain 
`int` form.
   
   ---
   
   ## 2. PyIceberg: `apache/iceberg-python`
   
   ### What PyIceberg writes
   
   PyIceberg also models `day(...)` as `DateType`.
   
   Relevant code paths:
   
   - `DayTransform.result_type`: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/transforms.py
   - `PartitionSpec.partition_type(...)`: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/partitioning.py
   - Avro schema conversion: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/schema_conversion.py
   
   The `DayTransform.result_type` implementation returns `DateType()`. The code 
comment explains the intent: the physical representation still conforms because 
`DateType` is internally converted to an int, while returning `DateType` gives 
a more human-readable display representation.
   
   PyIceberg's Avro conversion maps:
   
   | PyIceberg type | Avro schema |
   |---|---|
   | `IntegerType` | `"int"` |
   | `DateType` | `{ "type": "int", "logicalType": "date" }` |
   
   So PyIceberg writes manifest partition fields for `day(...)` as:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   ### What PyIceberg accepts
   
   PyIceberg recognizes both schema shapes at schema-conversion level.
   
   | Manifest Avro schema form | Converted PyIceberg type | Practical read 
behavior |
   |---|---|---|
   | Plain Avro `int` | `IntegerType` | Integer Avro encoding is readable in 
normal decode path |
   | Avro `int` with `logicalType: date` | `DateType` | Integer Avro encoding 
is readable as date logical type |
   
   Relevant reader path:
   
   - Avro reader implementation: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/reader.py
   
   At the primitive reader level, `DateReader` subclasses `IntegerReader`, so 
both forms use the same integer Avro encoding.
   
   ### PyIceberg caveat
   
   PyIceberg is less explicitly tolerant than Java when a caller forces a 
projected/read schema expecting `DateType` while the file schema says plain 
`IntegerType`. The normal file-schema decode path can handle both 
integer-encoded forms, but explicit int-to-date projection compatibility is not 
as clear as Java's tolerant reader branch.
   
   ### PyIceberg summary
   
   PyIceberg writes the same de facto convention as Java:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   It can read both integer-encoded forms in the normal Avro decode path, with 
the projection caveat above.
   
   ---
   
   ## 3. Go: `apache/iceberg-go`
   
   ### What Go writes on current `main`
   
   Current `iceberg-go` appears to follow the literal spec interpretation.
   
   Relevant code paths:
   
   - `DayTransform.ResultType(...)`: 
https://github.com/apache/iceberg-go/blob/main/transforms.go
   - `PartitionSpec.PartitionType(...)`: 
https://github.com/apache/iceberg-go/blob/main/partitions.go
   - Avro schema conversion: 
https://github.com/apache/iceberg-go/blob/main/schema_conversions.go
   
   `DayTransform.ResultType(...)` returns `PrimitiveTypes.Int32`. 
`PartitionSpec.PartitionType(...)` uses the transform result type when 
constructing the partition struct. The Avro conversion maps `Int32Type` to Avro 
`IntNode`, while `DateType` maps to Avro `DateNode`.
   
   So current Go writes manifest partition fields for `day(...)` as:
   
   ```json
   "type": "int"
   ```
   
   ### What Go accepts
   
   | Manifest Avro schema form | Current `iceberg-go` status |
   |---|---|
   | Plain Avro `int` | Confirmed: this is what it writes and expects |
   | Avro `int` with `logicalType: date` | Not clearly guaranteed on merged 
`main` |
   
   The Go manifest reader opens the Avro OCF file, uses the file schema, and 
builds logical-type metadata from the actual Avro schema, including nested 
partition fields. However, I would not call Java-style "accepts both" behavior 
confirmed without a merged explicit compatibility test/path.
   
   Relevant reader path:
   
   - Manifest code: https://github.com/apache/iceberg-go/blob/main/manifest.go
   
   ### Open Go PR changing this behavior
   
   There is an open Go PR proposing to encode `day(...)` partition fields with 
Avro logical `date` for interoperability with engines that expect the 
Java-style manifest schema:
   
   - https://github.com/apache/iceberg-go/pull/915
   
   The PR description says current Go encoded all `Int32` partition fields as 
plain Avro `int`, and that this caused interoperability issues with engines 
expecting the logical `date` annotation.
   
   As of this writing, that PR is open.
   
   ### Go summary
   
   Current Go writes the spec-literal form:
   
   ```json
   "type": "int"
   ```
   
   The open PR proposes moving Go toward the Java/PyIceberg/Rust convention:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   ---
   
   ## 4. Rust: `apache/iceberg-rust`
   
   ### What Rust writes
   
   Current `iceberg-rust` follows the Java/PyIceberg convention.
   
   Relevant code paths:
   
   - `Transform::Day.result_type(...)`: 
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/transform.rs
   - Avro schema conversion: 
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/avro/schema.rs
   - Manifest writer: 
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/manifest/writer.rs
   
   `Transform::Day.result_type(...)` maps timestamp/date source types to 
`PrimitiveType::Date`. Rust's Avro schema conversion maps `PrimitiveType::Date` 
to Avro date, while `PrimitiveType::Int` maps to plain Avro int. The manifest 
writer builds the manifest Avro schema from the partition type.
   
   So Rust writes manifest partition fields for `day(...)` as:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   ### Why Rust changed
   
   Rust originally mapped `Transform::Day` to `Int`. That caused manifest 
parsing failures for Java-written manifests whose day partition field used Avro 
logical `date`.
   
   Related issue and PR:
   
   - Issue: https://github.com/apache/iceberg-rust/issues/478
   - PR: https://github.com/apache/iceberg-rust/pull/479
   
   PR #479 changed `Transform::Day` to map partition types to `Date` rather 
than `Int` for consistency with the Java/reference implementation.
   
   ### What Rust accepts
   
   | Manifest Avro schema form | Current `iceberg-rust` status |
   |---|---|
   | Avro `int` with `logicalType: date` | Confirmed: this is what current Rust 
writes and expects |
   | Plain Avro `int` | Not clearly guaranteed in the normal manifest reader |
   
   The normal Rust manifest reader reconstructs the expected manifest schema 
from table metadata and the partition spec, then reads with that expected 
schema. The historical bug report shows that a mismatch between the expected 
schema and the manifest file schema caused parsing failure. Therefore, Rust 
should not be assumed to accept plain Avro `int` for `day(...)` partition 
fields unless an explicit compatibility path or test is added.
   
   Relevant reader path:
   
   - Manifest reader code: 
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/manifest/mod.rs
   
   ### Rust summary
   
   Rust writes the de facto implementation convention:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   Unlike Java, Rust should not currently be assumed to accept both schema 
forms in all manifest reading paths.
   
   ---
   
   ## Why this matters
   
   Although both schemas encode the same integer day count, the Avro schema 
annotation can matter for readers that construct an expected manifest schema 
and use strict Avro schema resolution.
   
   This has led to divergent implementation behavior:
   
   - Java/PyIceberg/Rust write logical `date`.
   - Go currently writes plain `int`.
   - Java accepts both forms.
   - Rust and Go may not accept both forms in all manifest reading paths.
   
   This creates uncertainty for new implementations and for existing 
implementations trying to decide whether to prioritize literal spec compliance 
or compatibility with the dominant Java/Spark convention.
   
   ---
   
   ## Questions for clarification
   
   Could the Iceberg spec clarify the intended canonical Avro schema for 
manifest partition fields produced by the `day(...)` transform?
   
   Specifically:
   
   1. Should the canonical transform result type for `day(...)` remain `int`, 
as currently documented?
   
   2. If the result type remains `int`, should manifest partition fields for 
`day(...)` be written as plain Avro `int`?
   
      ```json
      "type": "int"
      ```
   
   3. Or should `day(...)` be considered a special case where the logical 
result is date-like and writers should use Avro logical `date`?
   
      ```json
      {
        "type": "int",
        "logicalType": "date"
      }
      ```
   
   4. Should readers be required or strongly recommended to accept both schema 
forms for `day(...)` partition fields?
   
   5. If both forms are accepted, should one form be documented as canonical 
for writers?
   
   6. Should the partition transform result type table be changed from `int` to 
`date` for `day(...)`, or should the Avro manifest encoding guidance be changed 
to explicitly allow/use logical `date` for `day(...)` despite the transform 
table saying `int`?
   
   ---
   
   ## Possible resolutions
   
   ### Option A: Keep the spec-literal interpretation
   
   Clarify that `day(...)` result type is `int` and writers should use plain 
Avro `int` for manifest partition fields.
   
   Pros:
   
   - Matches the current partition transform table.
   - Matches Appendix A's mapping for Iceberg `int`.
   - Matches current `iceberg-go` behavior.
   
   Cons:
   
   - Diverges from current Java/PyIceberg/Rust behavior.
   - May require existing implementations to change writer behavior or document 
a compatibility exception.
   - Could cause compatibility problems with engines that expect logical `date`.
   
   ### Option B: Change or clarify `day(...)` result type to `date`
   
   Clarify that `day(...)` returns a date-like value and should be written as 
Avro logical `date`.
   
   Pros:
   
   - Matches Java/PyIceberg/Rust behavior.
   - Matches the de facto ecosystem convention.
   - Provides more human-readable metadata/table output.
   
   Cons:
   
   - Requires changing or qualifying the current transform result type table.
   - The transform table currently describes the result type as `int`.
   - Could make plain-`int` writers look non-compliant despite following the 
current spec text.
   
   ### Option C: Keep `int` result type but explicitly allow logical `date` in 
manifests
   
   Clarify that the physical value of `day(...)` is still an integer day count, 
but manifest Avro schemas may use either:
   
   ```json
   "type": "int"
   ```
   
   or:
   
   ```json
   {
     "type": "int",
     "logicalType": "date"
   }
   ```
   
   and readers should accept both.
   
   Pros:
   
   - Acknowledges the existing ecosystem split.
   - Avoids declaring current Java/PyIceberg/Rust or Go behavior invalid.
   - Encourages interoperability.
   
   Cons:
   
   - Leaves two writer encodings unless a canonical writer form is also chosen.
   - Requires readers to implement compatibility handling.
   
   ---
   
   ## Suggested clarification
   
   A possible clarification could be:
   
   > For partition fields produced by the `day` transform, the stored value is 
an integer number of days from `1970-01-01`. Manifest readers should accept 
both plain Avro `int` and Avro `int` annotated with `logicalType: date` for 
these fields. Writers should use [chosen canonical form].
   
   The main clarification needed is the final sentence: which form should 
writers use?
   
   ---
   
   ## Related issues and PRs
   
   - Java issue asking about this mismatch: 
https://github.com/apache/iceberg/issues/10616
   - Older Java issue discussing timestamp/day returning date for display: 
https://github.com/apache/iceberg/issues/279
   - Rust issue caused by Java date-logical-type manifests: 
https://github.com/apache/iceberg-rust/issues/478
   - Rust PR changing `Transform::Day` to `Date`: 
https://github.com/apache/iceberg-rust/pull/479
   - Go PR proposing logical `date` for day-transform partition fields: 
https://github.com/apache/iceberg-go/pull/915
   
   ---
   
   ## Requested outcome
   
   Please clarify the intended spec behavior for `day(...)` partition fields in 
manifest Avro schemas:
   
   - Is plain Avro `int` the canonical writer form?
   - Is Avro logical `date` the canonical writer form?
   - Are both forms allowed?
   - Should readers be required or strongly recommended to accept both?
   
   This would help align Java, PyIceberg, Go, Rust, and other implementations 
around one documented compatibility expectation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to