Re: [PR] Improve wording about error cases in `VariantShredding` [parquet-format]

via GitHub Sat, 18 Oct 2025 17:05:00 -0700


scovich commented on code in PR #523:
URL: https://github.com/apache/parquet-format/pull/523#discussion_r2420307237



##########
VariantShredding.md:
##########
@@ -65,12 +65,12 @@ The series of measurements `34, null, "n/a", 100` would be 
stored as:
 Both `value` and `typed_value` are optional fields used together to encode a 
single value.
 Values in the two fields must be interpreted according to the following table:
 
-| `value`  | `typed_value` | Meaning                                           
          |
-|----------|---------------|-------------------------------------------------------------|
-| null     | null          | The value is missing; only valid for shredded 
object fields |
-| non-null | null          | The value is present and may be any type, 
including null    |
-| null     | non-null      | The value is present and is the shredded type     
          |
-| non-null | non-null      | The value is present and is a partially shredded 
object     |
+| `value`  | `typed_value` | Meaning                                           
                                     |
+|----------|---------------|----------------------------------------------------------------------------------------|
+| null     | null          | The value is missing; only valid for shredded 
object fields                            |
+| non-null | null          | The value is present and may be any type (except 
the shredded type), including null    |

Review Comment:
   > > Going a step further -- will (should?) the shredding spec take a stance 
on whether writers are required to perform widening and/or lossless narrowing 
conversions and/or lossy narrowing conversions of values whose types are in the 
same equivalence class as the typed_value column?
   > 
   > I thought this was discussed this and the consensus was this type or 
normalization was an engine concern and did not belong in writers. Writers 
should faithfully, write the bits given to them.
   > 
   > I need to re-read the spec but I agree with @scovich we should probably 
revert this change and bring it up for larger discussion (unless this is 
already stated someplace else in the spec).
   
   I remember the same discussion and the same outcome, but the spec doesn't 
actually say it, and the binary variant spec still has this ambiguous 
"equivalence class" concept that seems both very under-defined and very 
engine-centric (casting and equivalence semantics). 
   
   In any case, if normalization decisions are up to the engine performing the 
writes, then the shredding spec shouldn't constrain the engine's choices of 
typing. The exceptions would be to forbid ambiguity and conflicts, which it 
generally does a decent job of already. 
   
   So the immediate question would be: Is it ambiguous or conflicting for the 
`value` column to contain an Int32 value, and for `typed_value` to be NULL at 
that row, when `typed_value` is INT32? I think that's where we diverge from the 
seemingly-similar situation with partial shredding. There's no conflict -- only 
one value is available for that row -- and there's no ambiguity in what the 
value is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve wording about error cases in `VariantShredding` [parquet-format]

Reply via email to