laskoviymishka commented on code in PR #2366:
URL: https://github.com/apache/iceberg-rust/pull/2366#discussion_r3386557508
##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
"partition-spec is required in manifest metadata but
not found",
)
})?;
-
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
- Error::new(
- ErrorKind::DataInvalid,
- "Fail to parse partition spec in manifest metadata",
- )
- .with_source(err)
- })?
+ // Accept either shape that Iceberg writers produce here:
+ // - bare JSON array `[{...}]` — historical iceberg-rust
shortcut
+ // - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+ // iceberg-java / iceberg-cpp / pyiceberg
+ // Previously only the bare array was accepted, which made
+ // fast_append fail against tables where any non-rust writer
+ // had committed before (the parent-manifest load for the
+ // duplicate-file check tripped on the object shape).
+ serde_json::from_slice::<Vec<PartitionField>>(bs)
Review Comment:
The `or_else(|_| ...)` swallows the first error, so a genuinely malformed
bare array surfaces as the object-shape error ("expected field `fields`")
instead of the real problem. An untagged enum would give us a single-pass parse
and a coherent error, and drops the inline struct too:
```rust
#[derive(serde::Deserialize)]
#[serde(untagged)]
enum PartitionSpecRepr {
Array(Vec<PartitionField>),
Object { fields: Vec<PartitionField> },
}
```
(If we keep `spec-id` cross-checking from the note above, add it to the
`Object` variant.) wdyt?
##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
"partition-spec is required in manifest metadata but
not found",
)
})?;
-
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
- Error::new(
- ErrorKind::DataInvalid,
- "Fail to parse partition spec in manifest metadata",
- )
- .with_source(err)
- })?
+ // Accept either shape that Iceberg writers produce here:
Review Comment:
This comment has the canonical shape inverted, and it'll mislead whoever
touches this next. The bare array is the spec-mandated, universal form — Java
and PyIceberg both write it — and the object shape is the non-standard one. So
this fallback is a defensive forward-compat hedge for non-conformant inputs,
not legacy support. I'd reword to say the array is canonical and the object is
the defensive case.
##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
"partition-spec is required in manifest metadata but
not found",
)
})?;
-
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
- Error::new(
- ErrorKind::DataInvalid,
- "Fail to parse partition spec in manifest metadata",
- )
- .with_source(err)
- })?
+ // Accept either shape that Iceberg writers produce here:
+ // - bare JSON array `[{...}]` — historical iceberg-rust
shortcut
+ // - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+ // iceberg-java / iceberg-cpp / pyiceberg
+ // Previously only the bare array was accepted, which made
+ // fast_append fail against tables where any non-rust writer
+ // had committed before (the parent-manifest load for the
+ // duplicate-file check tripped on the object shape).
+ serde_json::from_slice::<Vec<PartitionField>>(bs)
+ .or_else(|_| {
+ #[derive(serde::Deserialize)]
+ struct PartitionSpecJson {
+ fields: Vec<PartitionField>,
Review Comment:
When we parse the object form we drop its embedded `spec-id` on the floor,
so if it ever disagrees with the `partition-spec-id` key we'd silently pick one
and never notice. If the writer is reverted there's only one source of truth
and this is moot — but as long as the reader accepts the object, I'd at least
read `spec-id: Option<i32>` and return `DataInvalid` on a mismatch rather than
ignore it.
##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
"partition-spec is required in manifest metadata but
not found",
)
})?;
-
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
- Error::new(
- ErrorKind::DataInvalid,
- "Fail to parse partition spec in manifest metadata",
- )
- .with_source(err)
- })?
+ // Accept either shape that Iceberg writers produce here:
+ // - bare JSON array `[{...}]` — historical iceberg-rust
shortcut
+ // - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+ // iceberg-java / iceberg-cpp / pyiceberg
+ // Previously only the bare array was accepted, which made
Review Comment:
This block is quite a bit longer than the surrounding code's comment style —
once the wording is corrected, I'd trim it to a couple of lines. Same on the
writer side, where it's split across two `//` groups where one would do.
##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
"partition-spec is required in manifest metadata but
not found",
)
})?;
-
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
- Error::new(
- ErrorKind::DataInvalid,
- "Fail to parse partition spec in manifest metadata",
- )
- .with_source(err)
- })?
+ // Accept either shape that Iceberg writers produce here:
+ // - bare JSON array `[{...}]` — historical iceberg-rust
shortcut
+ // - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+ // iceberg-java / iceberg-cpp / pyiceberg
+ // Previously only the bare array was accepted, which made
+ // fast_append fail against tables where any non-rust writer
+ // had committed before (the parent-manifest load for the
+ // duplicate-file check tripped on the object shape).
+ serde_json::from_slice::<Vec<PartitionField>>(bs)
+ .or_else(|_| {
+ #[derive(serde::Deserialize)]
+ struct PartitionSpecJson {
+ fields: Vec<PartitionField>,
+ }
+
serde_json::from_slice::<PartitionSpecJson>(bs).map(|s| s.fields)
Review Comment:
There's no dedicated test for this object-shape read path — coverage is only
implicit via round-trip. I'd add a unit test that feeds object-shaped bytes to
the reader and asserts the fields parse correctly, plus one that passes a
bare-array payload through (the spec form). And once the writer direction is
settled, a test asserting the raw avro `partition-spec` bytes are a JSON array
would lock the format down. `test_parse_snappy_manifest_v2` (mod.rs:376) is
currently the only check of the array path and reads as an unlabeled canary —
worth a comment marking it as the spec-form test.
##########
crates/iceberg/src/spec/manifest/writer.rs:
##########
@@ -428,7 +428,12 @@ impl ManifestWriter {
)?;
avro_writer.add_user_metadata(
"partition-spec".to_string(),
- to_vec(&self.metadata.partition_spec.fields()).map_err(|err| {
+ // Serialize the full PartitionSpec object
({"spec-id":N,"fields":[...]})
+ // so iceberg-java / iceberg-cpp / pyiceberg can read manifests
+ // written by iceberg-rust. The previous output (bare `fields`
+ // array) is a rust-only shortcut that other Iceberg
+ // implementations reject.
+ to_vec(&self.metadata.partition_spec).map_err(|err| {
Review Comment:
I think the direction of this change is backwards, and it's the one thing
I'd want settled before merge.
The manifest user-metadata `partition-spec` key is spec'd as the fields
array only, not the full spec object. The two shapes are:
```json
// spec-mandated for this key (bare array)
[{"name":"x","transform":"identity","source-id":1,"field-id":1000}]
// what this PR now writes (object) — non-standard here
{"spec-id":0,"fields":[{"name":"x",...}]}
```
I checked this against current `main`: the spec says the `partition-spec`
manifest key is "JSON representation of only the partition fields array of the
partition spec used to write the manifest," with the `spec-id` carried
separately in the `partition-spec-id` key. Java's
`PartitionSpecParser.toJsonFields` writes the bare array, and its reader path
`buildFromJsonFields` does `Preconditions.checkArgument(json.isArray(), "Cannot
parse partition spec fields, not an array")` — so it throws on the object form.
PyIceberg's `manifest.py` likewise writes `to_json(self._spec.fields)`, a bare
array. That means this line would make rust-written manifests unreadable by
Java, which is the opposite of the goal. I'd revert it to
`to_vec(&self.metadata.partition_spec.fields())` and let the reader fallback be
the whole fix. wdyt?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]