wombatu-kun commented on issue #10635:
URL: https://github.com/apache/iceberg/issues/10635#issuecomment-4562070877

   Picking this up - I tried implementing the write path and want to share 
findings + a question on direction before doing more work, since the right 
answer depends on how strictly @nastra's earlier comment ("you can only 
represent it as a string when you write a UUID") should be read.
   
   ## What actually happens at runtime
   
   When a user writes a Spark `BinaryType` column into an Iceberg `UUID` column 
via the DataFrame API (`df.write.format("iceberg").save(...)`), the byte array 
does not reach `Spark*Writers$UUIDWriter` as a `byte[]`. Spark's V2 write 
planner first compares the input schema (`BinaryType`) against the catalog 
table schema (which exposes `UUID` as `StringType` via `TypeToSparkType`), and 
inserts a `cast(binary as string)` in the projection. That cast is a 
byte-for-byte UTF-8 reinterpretation of the 16 raw bytes, so the writer 
receives a `UTF8String` whose `getBytes()` is the original 16 UUID bytes. 
`UUID.fromString(s.toString())` then fails with `Invalid UUID string: 
<garbage>` - exactly the error in the bug report.
   
   There are also two other points worth flagging:
   
   1. `SparkFixupTypes.fixupPrimitive` only matches `STRING -> UUID` today. To 
even reach the writer for `BinaryType` input, the user's schema has to be 
accepted by `validateWriteSchema`, which today rejects it with "cannot be 
promoted to UUID" before any cast happens. So a `BINARY -> UUID` case in 
`SparkFixupTypes` is a prerequisite for any DataFrame-level fix.
   2. There is a separate programmatic path where a caller builds writers 
directly with a custom Spark `StructType` containing `BinaryType` for a UUID 
column (e.g. `SparkParquetWriters.buildWriter(structType, msgType)`). In that 
path the cast does not happen and `byte[]` does reach the writer - so a 
writer-level fix that only handles `UTF8String` would miss this case. This may 
or may not be in scope depending on how we read @nastra's comment.
   
   ## Options
   
   **(A) Make the existing `UUIDWriter` tolerant of the post-cast 16-byte 
`UTF8String`**, in addition to the canonical 36-char form. Concretely: in 
`Spark{Parquet,Value,Orc}Writers$UUIDWriter.write(UTF8String s)`, branch on 
`s.getBytes().length`: 16 -> write bytes directly; 36 -> `UUID.fromString` as 
today; otherwise throw a clear error. Also add `case BINARY: if (source == 
UUID) return true;` to `SparkFixupTypes` so the schema check accepts the user's 
`BinaryType` input. This is the smallest change and has no performance 
regression on the canonical path (one int compare). It does not touch 
`TypeToSparkType` so it does not change how Iceberg UUID columns are exposed to 
Spark - the column is still `StringType`, the user still passes data that Spark 
casts to a `UTF8String`. The trade-off is that it makes the writer accept some 
16-byte non-UUID `UTF8String` inputs as if they were raw UUID bytes (e.g. 
exactly-16-character text written into a UUID column would silently round-trip 
as a gar
 bage UUID). Today that same case throws `Invalid UUID string`.
   
   **(B) Change `TypeToSparkType: UUID -> BinaryType(16)`** (matching Flink) 
and make the read/write path consistent with that. This is the change 
@anuragmantri originally asked about and was the alternative @nastra appeared 
to rule out in the linked comment. It would also touch many existing tests that 
assume `StringType`.
   
   **(C) Don't fix; close as `wontfix` and document that users must convert 
`uuid.uuid4().bytes` to the canonical hex string client-side**.
   
   There is also a programmatic-API variant of (A) where the 
writer-construction visitors learn to dispatch to a `byte[]`-accepting 
`UUIDBytesWriter` when the caller passes `BinaryType` in the Spark schema - I 
prototyped this and it works for the programmatic path but adds non-trivial 
plumbing (a 3-arg `SparkOrcWriter` constructor + an `idToSparkType` map 
threaded through `WriteBuilder` / `createFieldGetter` / `ListWriter` / 
`MapWriter`). For the DataFrame API alone this plumbing is dead code, because 
Spark's cast happens upstream.
   
   ## Question
   
   @nastra @RussellSpitzer - is (A) acceptable as a writer-level tolerance fix 
(with the documented length-16 corner case), or do you want (B) / (C) instead? 
Happy to implement whichever direction you pick; just want to avoid opening a 
PR that goes against the intent of your earlier comment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to