laskoviymishka commented on code in PR #2366:
URL: https://github.com/apache/iceberg-rust/pull/2366#discussion_r3386557508


##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
                         "partition-spec is required in manifest metadata but 
not found",
                     )
                 })?;
-                
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
-                    Error::new(
-                        ErrorKind::DataInvalid,
-                        "Fail to parse partition spec in manifest metadata",
-                    )
-                    .with_source(err)
-                })?
+                // Accept either shape that Iceberg writers produce here:
+                //   - bare JSON array `[{...}]` — historical iceberg-rust 
shortcut
+                //   - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+                //     iceberg-java / iceberg-cpp / pyiceberg
+                // Previously only the bare array was accepted, which made
+                // fast_append fail against tables where any non-rust writer
+                // had committed before (the parent-manifest load for the
+                // duplicate-file check tripped on the object shape).
+                serde_json::from_slice::<Vec<PartitionField>>(bs)

Review Comment:
   The `or_else(|_| ...)` swallows the first error, so a genuinely malformed 
bare array surfaces as the object-shape error ("expected field `fields`") 
instead of the real problem. An untagged enum would give us a single-pass parse 
and a coherent error, and drops the inline struct too:
   
   ```rust
   #[derive(serde::Deserialize)]
   #[serde(untagged)]
   enum PartitionSpecRepr {
       Array(Vec<PartitionField>),
       Object { fields: Vec<PartitionField> },
   }
   ```
   
   (If we keep `spec-id` cross-checking from the note above, add it to the 
`Object` variant.) wdyt?



##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
                         "partition-spec is required in manifest metadata but 
not found",
                     )
                 })?;
-                
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
-                    Error::new(
-                        ErrorKind::DataInvalid,
-                        "Fail to parse partition spec in manifest metadata",
-                    )
-                    .with_source(err)
-                })?
+                // Accept either shape that Iceberg writers produce here:

Review Comment:
   This comment has the canonical shape inverted, and it'll mislead whoever 
touches this next. The bare array is the spec-mandated, universal form — Java 
and PyIceberg both write it — and the object shape is the non-standard one. So 
this fallback is a defensive forward-compat hedge for non-conformant inputs, 
not legacy support. I'd reword to say the array is canonical and the object is 
the defensive case.



##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
                         "partition-spec is required in manifest metadata but 
not found",
                     )
                 })?;
-                
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
-                    Error::new(
-                        ErrorKind::DataInvalid,
-                        "Fail to parse partition spec in manifest metadata",
-                    )
-                    .with_source(err)
-                })?
+                // Accept either shape that Iceberg writers produce here:
+                //   - bare JSON array `[{...}]` — historical iceberg-rust 
shortcut
+                //   - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+                //     iceberg-java / iceberg-cpp / pyiceberg
+                // Previously only the bare array was accepted, which made
+                // fast_append fail against tables where any non-rust writer
+                // had committed before (the parent-manifest load for the
+                // duplicate-file check tripped on the object shape).
+                serde_json::from_slice::<Vec<PartitionField>>(bs)
+                    .or_else(|_| {
+                        #[derive(serde::Deserialize)]
+                        struct PartitionSpecJson {
+                            fields: Vec<PartitionField>,

Review Comment:
   When we parse the object form we drop its embedded `spec-id` on the floor, 
so if it ever disagrees with the `partition-spec-id` key we'd silently pick one 
and never notice. If the writer is reverted there's only one source of truth 
and this is moot — but as long as the reader accepts the object, I'd at least 
read `spec-id: Option<i32>` and return `DataInvalid` on a mismatch rather than 
ignore it.



##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
                         "partition-spec is required in manifest metadata but 
not found",
                     )
                 })?;
-                
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
-                    Error::new(
-                        ErrorKind::DataInvalid,
-                        "Fail to parse partition spec in manifest metadata",
-                    )
-                    .with_source(err)
-                })?
+                // Accept either shape that Iceberg writers produce here:
+                //   - bare JSON array `[{...}]` — historical iceberg-rust 
shortcut
+                //   - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+                //     iceberg-java / iceberg-cpp / pyiceberg
+                // Previously only the bare array was accepted, which made

Review Comment:
   This block is quite a bit longer than the surrounding code's comment style — 
once the wording is corrected, I'd trim it to a couple of lines. Same on the 
writer side, where it's split across two `//` groups where one would do.



##########
crates/iceberg/src/spec/manifest/metadata.rs:
##########
@@ -80,13 +80,29 @@ impl ManifestMetadata {
                         "partition-spec is required in manifest metadata but 
not found",
                     )
                 })?;
-                
serde_json::from_slice::<Vec<PartitionField>>(bs).map_err(|err| {
-                    Error::new(
-                        ErrorKind::DataInvalid,
-                        "Fail to parse partition spec in manifest metadata",
-                    )
-                    .with_source(err)
-                })?
+                // Accept either shape that Iceberg writers produce here:
+                //   - bare JSON array `[{...}]` — historical iceberg-rust 
shortcut
+                //   - spec-compliant object `{"spec-id":N,"fields":[{...}]}` —
+                //     iceberg-java / iceberg-cpp / pyiceberg
+                // Previously only the bare array was accepted, which made
+                // fast_append fail against tables where any non-rust writer
+                // had committed before (the parent-manifest load for the
+                // duplicate-file check tripped on the object shape).
+                serde_json::from_slice::<Vec<PartitionField>>(bs)
+                    .or_else(|_| {
+                        #[derive(serde::Deserialize)]
+                        struct PartitionSpecJson {
+                            fields: Vec<PartitionField>,
+                        }
+                        
serde_json::from_slice::<PartitionSpecJson>(bs).map(|s| s.fields)

Review Comment:
   There's no dedicated test for this object-shape read path — coverage is only 
implicit via round-trip. I'd add a unit test that feeds object-shaped bytes to 
the reader and asserts the fields parse correctly, plus one that passes a 
bare-array payload through (the spec form). And once the writer direction is 
settled, a test asserting the raw avro `partition-spec` bytes are a JSON array 
would lock the format down. `test_parse_snappy_manifest_v2` (mod.rs:376) is 
currently the only check of the array path and reads as an unlabeled canary — 
worth a comment marking it as the spec-form test.



##########
crates/iceberg/src/spec/manifest/writer.rs:
##########
@@ -428,7 +428,12 @@ impl ManifestWriter {
         )?;
         avro_writer.add_user_metadata(
             "partition-spec".to_string(),
-            to_vec(&self.metadata.partition_spec.fields()).map_err(|err| {
+            // Serialize the full PartitionSpec object 
({"spec-id":N,"fields":[...]})
+            // so iceberg-java / iceberg-cpp / pyiceberg can read manifests
+            // written by iceberg-rust. The previous output (bare `fields`
+            // array) is a rust-only shortcut that other Iceberg
+            // implementations reject.
+            to_vec(&self.metadata.partition_spec).map_err(|err| {

Review Comment:
   I think the direction of this change is backwards, and it's the one thing 
I'd want settled before merge.
   
   The manifest user-metadata `partition-spec` key is spec'd as the fields 
array only, not the full spec object. The two shapes are:
   
   ```json
   // spec-mandated for this key (bare array)
   [{"name":"x","transform":"identity","source-id":1,"field-id":1000}]
   
   // what this PR now writes (object) — non-standard here
   {"spec-id":0,"fields":[{"name":"x",...}]}
   ```
   
   I checked this against current `main`: the spec says the `partition-spec` 
manifest key is "JSON representation of only the partition fields array of the 
partition spec used to write the manifest," with the `spec-id` carried 
separately in the `partition-spec-id` key. Java's 
`PartitionSpecParser.toJsonFields` writes the bare array, and its reader path 
`buildFromJsonFields` does `Preconditions.checkArgument(json.isArray(), "Cannot 
parse partition spec fields, not an array")` — so it throws on the object form. 
PyIceberg's `manifest.py` likewise writes `to_json(self._spec.fields)`, a bare 
array. That means this line would make rust-written manifests unreadable by 
Java, which is the opposite of the goal. I'd revert it to 
`to_vec(&self.metadata.partition_spec.fields())` and let the reader fallback be 
the whole fix. wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to