kevinjqliu opened a new issue, #16414:
URL: https://github.com/apache/iceberg/issues/16414
This issue is ai assisted, and I have verified all the details.
## Summary
The Iceberg spec says the `day` partition transform has result type `int`,
where the value is the number of days from `1970-01-01`.
A literal reading of the spec plus the Avro type mapping implies that
manifest partition fields produced by `day(...)` should be encoded as plain
Avro `int`:
```json
"type": "int"
```
However, several Iceberg implementations model `day(...)` partition fields
as an Iceberg `date` / Avro logical `date`, and therefore write:
```json
{
"type": "int",
"logicalType": "date"
}
```
Both forms have the same Avro physical encoding: an integer day count. The
difference is the Avro schema logical type annotation. This can still affect
interoperability because some readers validate manifest schemas strictly.
This issue is to document the ambiguity and raise a discussion about the
intended canonical behavior.
---
## Quick comparison
| Implementation | Internal result type for `day(...)` | Manifest Avro
schema written | Accepts plain Avro `int`? | Accepts Avro `int` + `logicalType:
date`? |
|---|---|---|---:|---:|
| Spec literal interpretation | `int` | plain Avro `int` | N/A | N/A |
| Java Iceberg, `apache/iceberg` | `DateType` | Avro logical `date` | Yes |
Yes |
| PyIceberg, `apache/iceberg-python` | `DateType` | Avro logical `date` |
Generally yes in normal decode path; projection caveat | Yes |
| Go, `apache/iceberg-go` current `main` | `Int32Type` | plain Avro `int` |
Yes | Not clearly guaranteed on merged `main`; open PR proposes logical `date` |
| Rust, `apache/iceberg-rust` current `main` | `Date` | Avro logical `date`
| Not clearly guaranteed in normal manifest reader | Yes |
---
## Spec references and interpretation
### Partition transform result type
The partition transform table says:
> `day` Extract a date or timestamp day, as days from 1970-01-01 `date`,
`timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` `int`
Source: https://iceberg.apache.org/spec/#partition-transforms
This explicitly lists the result type for `day` as `int`, not `date`.
### Manifest partition field schema
The manifest section says:
> A manifest is an immutable Avro file that lists data files or delete
files, along with each file's partition data tuple, metrics, and tracking
information.
For the `partition` field in a manifest entry, the spec says:
> Partition data tuple, schema based on the partition spec output using
partition field ids for the struct field ids
Source: https://iceberg.apache.org/spec/#manifests
This means the Avro schema for the manifest `partition` struct is derived
from the partition spec output. Therefore, the transform result type matters
for the Avro schema used in the manifest.
### Avro type mapping
Appendix A says:
> Values should be stored in Avro using the Avro types and logical type
annotations in the table below.
The Avro mapping table distinguishes Iceberg `int` from Iceberg `date`:
| Iceberg type | Avro representation | Note |
|---|---|---|
| `int` | `int` | |
| `date` | `{ "type": "int", "logicalType": "date" }` | Stores days from
1970-01-01. |
Source:
https://iceberg.apache.org/spec/#appendix-a-format-specific-requirements
### Literal spec interpretation
Because `day(...)` has result type `int`, and Iceberg `int` maps to plain
Avro `int`, the literal spec interpretation appears to be:
```json
"type": "int"
```
not:
```json
{
"type": "int",
"logicalType": "date"
}
```
The spec uses the word `should` for the Avro mapping, so this may not be an
absolute `MUST`-style requirement. But it is unclear whether the intended
canonical manifest schema for `day(...)` partition fields is plain `int` or
logical `date`.
---
## Implementation behavior
## 1. Java Iceberg: `apache/iceberg`
### What Java writes
Java Iceberg models `day(...)` as `DateType`, not `IntegerType`.
Relevant code paths:
- `Dates.getResultType(...)`:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Dates.java
- `Timestamps.getResultType(...)`:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
- `TypeToSchema.DATE_SCHEMA`:
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/TypeToSchema.java
For day granularity, both date and timestamp transforms return
`Types.DateType.get()`. Java's Avro conversion writes `DateType` as Avro `int`
annotated with logical type `date`.
So Java writes manifest partition fields for `day(...)` as:
```json
{
"type": "int",
"logicalType": "date"
}
```
### What Java accepts
Java Iceberg accepts both forms.
| Manifest Avro schema form | Accepted? | Behavior |
|---|---:|---|
| Plain Avro `int` | Yes | Reads as raw integer day count |
| Avro `int` with `logicalType: date` | Yes | Reads as raw integer day count
|
Relevant reader path:
- `InternalReader.primitive(...)`:
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/InternalReader.java
The Java Avro reader maps Avro logical type `date` to `ValueReaders.ints()`.
Plain Avro `INT` also maps to `ValueReaders.ints()`, except for the generic
int-to-long widening case.
### Java summary
Java writes the de facto implementation convention:
```json
{
"type": "int",
"logicalType": "date"
}
```
but reads both the Java-style annotated form and the spec-literal plain
`int` form.
---
## 2. PyIceberg: `apache/iceberg-python`
### What PyIceberg writes
PyIceberg also models `day(...)` as `DateType`.
Relevant code paths:
- `DayTransform.result_type`:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/transforms.py
- `PartitionSpec.partition_type(...)`:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/partitioning.py
- Avro schema conversion:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/utils/schema_conversion.py
The `DayTransform.result_type` implementation returns `DateType()`. The code
comment explains the intent: the physical representation still conforms because
`DateType` is internally converted to an int, while returning `DateType` gives
a more human-readable display representation.
PyIceberg's Avro conversion maps:
| PyIceberg type | Avro schema |
|---|---|
| `IntegerType` | `"int"` |
| `DateType` | `{ "type": "int", "logicalType": "date" }` |
So PyIceberg writes manifest partition fields for `day(...)` as:
```json
{
"type": "int",
"logicalType": "date"
}
```
### What PyIceberg accepts
PyIceberg recognizes both schema shapes at schema-conversion level.
| Manifest Avro schema form | Converted PyIceberg type | Practical read
behavior |
|---|---|---|
| Plain Avro `int` | `IntegerType` | Integer Avro encoding is readable in
normal decode path |
| Avro `int` with `logicalType: date` | `DateType` | Integer Avro encoding
is readable as date logical type |
Relevant reader path:
- Avro reader implementation:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/reader.py
At the primitive reader level, `DateReader` subclasses `IntegerReader`, so
both forms use the same integer Avro encoding.
### PyIceberg caveat
PyIceberg is less explicitly tolerant than Java when a caller forces a
projected/read schema expecting `DateType` while the file schema says plain
`IntegerType`. The normal file-schema decode path can handle both
integer-encoded forms, but explicit int-to-date projection compatibility is not
as clear as Java's tolerant reader branch.
### PyIceberg summary
PyIceberg writes the same de facto convention as Java:
```json
{
"type": "int",
"logicalType": "date"
}
```
It can read both integer-encoded forms in the normal Avro decode path, with
the projection caveat above.
---
## 3. Go: `apache/iceberg-go`
### What Go writes on current `main`
Current `iceberg-go` appears to follow the literal spec interpretation.
Relevant code paths:
- `DayTransform.ResultType(...)`:
https://github.com/apache/iceberg-go/blob/main/transforms.go
- `PartitionSpec.PartitionType(...)`:
https://github.com/apache/iceberg-go/blob/main/partitions.go
- Avro schema conversion:
https://github.com/apache/iceberg-go/blob/main/schema_conversions.go
`DayTransform.ResultType(...)` returns `PrimitiveTypes.Int32`.
`PartitionSpec.PartitionType(...)` uses the transform result type when
constructing the partition struct. The Avro conversion maps `Int32Type` to Avro
`IntNode`, while `DateType` maps to Avro `DateNode`.
So current Go writes manifest partition fields for `day(...)` as:
```json
"type": "int"
```
### What Go accepts
| Manifest Avro schema form | Current `iceberg-go` status |
|---|---|
| Plain Avro `int` | Confirmed: this is what it writes and expects |
| Avro `int` with `logicalType: date` | Not clearly guaranteed on merged
`main` |
The Go manifest reader opens the Avro OCF file, uses the file schema, and
builds logical-type metadata from the actual Avro schema, including nested
partition fields. However, I would not call Java-style "accepts both" behavior
confirmed without a merged explicit compatibility test/path.
Relevant reader path:
- Manifest code: https://github.com/apache/iceberg-go/blob/main/manifest.go
### Open Go PR changing this behavior
There is an open Go PR proposing to encode `day(...)` partition fields with
Avro logical `date` for interoperability with engines that expect the
Java-style manifest schema:
- https://github.com/apache/iceberg-go/pull/915
The PR description says current Go encoded all `Int32` partition fields as
plain Avro `int`, and that this caused interoperability issues with engines
expecting the logical `date` annotation.
As of this writing, that PR is open.
### Go summary
Current Go writes the spec-literal form:
```json
"type": "int"
```
The open PR proposes moving Go toward the Java/PyIceberg/Rust convention:
```json
{
"type": "int",
"logicalType": "date"
}
```
---
## 4. Rust: `apache/iceberg-rust`
### What Rust writes
Current `iceberg-rust` follows the Java/PyIceberg convention.
Relevant code paths:
- `Transform::Day.result_type(...)`:
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/transform.rs
- Avro schema conversion:
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/avro/schema.rs
- Manifest writer:
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/manifest/writer.rs
`Transform::Day.result_type(...)` maps timestamp/date source types to
`PrimitiveType::Date`. Rust's Avro schema conversion maps `PrimitiveType::Date`
to Avro date, while `PrimitiveType::Int` maps to plain Avro int. The manifest
writer builds the manifest Avro schema from the partition type.
So Rust writes manifest partition fields for `day(...)` as:
```json
{
"type": "int",
"logicalType": "date"
}
```
### Why Rust changed
Rust originally mapped `Transform::Day` to `Int`. That caused manifest
parsing failures for Java-written manifests whose day partition field used Avro
logical `date`.
Related issue and PR:
- Issue: https://github.com/apache/iceberg-rust/issues/478
- PR: https://github.com/apache/iceberg-rust/pull/479
PR #479 changed `Transform::Day` to map partition types to `Date` rather
than `Int` for consistency with the Java/reference implementation.
### What Rust accepts
| Manifest Avro schema form | Current `iceberg-rust` status |
|---|---|
| Avro `int` with `logicalType: date` | Confirmed: this is what current Rust
writes and expects |
| Plain Avro `int` | Not clearly guaranteed in the normal manifest reader |
The normal Rust manifest reader reconstructs the expected manifest schema
from table metadata and the partition spec, then reads with that expected
schema. The historical bug report shows that a mismatch between the expected
schema and the manifest file schema caused parsing failure. Therefore, Rust
should not be assumed to accept plain Avro `int` for `day(...)` partition
fields unless an explicit compatibility path or test is added.
Relevant reader path:
- Manifest reader code:
https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/manifest/mod.rs
### Rust summary
Rust writes the de facto implementation convention:
```json
{
"type": "int",
"logicalType": "date"
}
```
Unlike Java, Rust should not currently be assumed to accept both schema
forms in all manifest reading paths.
---
## Why this matters
Although both schemas encode the same integer day count, the Avro schema
annotation can matter for readers that construct an expected manifest schema
and use strict Avro schema resolution.
This has led to divergent implementation behavior:
- Java/PyIceberg/Rust write logical `date`.
- Go currently writes plain `int`.
- Java accepts both forms.
- Rust and Go may not accept both forms in all manifest reading paths.
This creates uncertainty for new implementations and for existing
implementations trying to decide whether to prioritize literal spec compliance
or compatibility with the dominant Java/Spark convention.
---
## Questions for clarification
Could the Iceberg spec clarify the intended canonical Avro schema for
manifest partition fields produced by the `day(...)` transform?
Specifically:
1. Should the canonical transform result type for `day(...)` remain `int`,
as currently documented?
2. If the result type remains `int`, should manifest partition fields for
`day(...)` be written as plain Avro `int`?
```json
"type": "int"
```
3. Or should `day(...)` be considered a special case where the logical
result is date-like and writers should use Avro logical `date`?
```json
{
"type": "int",
"logicalType": "date"
}
```
4. Should readers be required or strongly recommended to accept both schema
forms for `day(...)` partition fields?
5. If both forms are accepted, should one form be documented as canonical
for writers?
6. Should the partition transform result type table be changed from `int` to
`date` for `day(...)`, or should the Avro manifest encoding guidance be changed
to explicitly allow/use logical `date` for `day(...)` despite the transform
table saying `int`?
---
## Possible resolutions
### Option A: Keep the spec-literal interpretation
Clarify that `day(...)` result type is `int` and writers should use plain
Avro `int` for manifest partition fields.
Pros:
- Matches the current partition transform table.
- Matches Appendix A's mapping for Iceberg `int`.
- Matches current `iceberg-go` behavior.
Cons:
- Diverges from current Java/PyIceberg/Rust behavior.
- May require existing implementations to change writer behavior or document
a compatibility exception.
- Could cause compatibility problems with engines that expect logical `date`.
### Option B: Change or clarify `day(...)` result type to `date`
Clarify that `day(...)` returns a date-like value and should be written as
Avro logical `date`.
Pros:
- Matches Java/PyIceberg/Rust behavior.
- Matches the de facto ecosystem convention.
- Provides more human-readable metadata/table output.
Cons:
- Requires changing or qualifying the current transform result type table.
- The transform table currently describes the result type as `int`.
- Could make plain-`int` writers look non-compliant despite following the
current spec text.
### Option C: Keep `int` result type but explicitly allow logical `date` in
manifests
Clarify that the physical value of `day(...)` is still an integer day count,
but manifest Avro schemas may use either:
```json
"type": "int"
```
or:
```json
{
"type": "int",
"logicalType": "date"
}
```
and readers should accept both.
Pros:
- Acknowledges the existing ecosystem split.
- Avoids declaring current Java/PyIceberg/Rust or Go behavior invalid.
- Encourages interoperability.
Cons:
- Leaves two writer encodings unless a canonical writer form is also chosen.
- Requires readers to implement compatibility handling.
---
## Suggested clarification
A possible clarification could be:
> For partition fields produced by the `day` transform, the stored value is
an integer number of days from `1970-01-01`. Manifest readers should accept
both plain Avro `int` and Avro `int` annotated with `logicalType: date` for
these fields. Writers should use [chosen canonical form].
The main clarification needed is the final sentence: which form should
writers use?
---
## Related issues and PRs
- Java issue asking about this mismatch:
https://github.com/apache/iceberg/issues/10616
- Older Java issue discussing timestamp/day returning date for display:
https://github.com/apache/iceberg/issues/279
- Rust issue caused by Java date-logical-type manifests:
https://github.com/apache/iceberg-rust/issues/478
- Rust PR changing `Transform::Day` to `Date`:
https://github.com/apache/iceberg-rust/pull/479
- Go PR proposing logical `date` for day-transform partition fields:
https://github.com/apache/iceberg-go/pull/915
---
## Requested outcome
Please clarify the intended spec behavior for `day(...)` partition fields in
manifest Avro schemas:
- Is plain Avro `int` the canonical writer form?
- Is Avro logical `date` the canonical writer form?
- Are both forms allowed?
- Should readers be required or strongly recommended to accept both?
This would help align Java, PyIceberg, Go, Rust, and other implementations
around one documented compatibility expectation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]