Re: [PR] Improve wording about error cases in `VariantShredding` [parquet-format]

via GitHub Wed, 08 Oct 2025 07:59:01 -0700


scovich commented on code in PR #523:
URL: https://github.com/apache/parquet-format/pull/523#discussion_r2414031848



##########
VariantShredding.md:
##########
@@ -65,12 +65,12 @@ The series of measurements `34, null, "n/a", 100` would be 
stored as:
 Both `value` and `typed_value` are optional fields used together to encode a 
single value.
 Values in the two fields must be interpreted according to the following table:
 
-| `value`  | `typed_value` | Meaning                                           
          |
-|----------|---------------|-------------------------------------------------------------|
-| null     | null          | The value is missing; only valid for shredded 
object fields |
-| non-null | null          | The value is present and may be any type, 
including null    |
-| null     | non-null      | The value is present and is the shredded type     
          |
-| non-null | non-null      | The value is present and is a partially shredded 
object     |
+| `value`  | `typed_value` | Meaning                                           
                                     |
+|----------|---------------|----------------------------------------------------------------------------------------|
+| null     | null          | The value is missing; only valid for shredded 
object fields                            |
+| non-null | null          | The value is present and may be any type (except 
the shredded type), including null    |

Review Comment:
   This is a material change in the spec -- today a NULL `typed_value` column 
entry puts _no constraints whatsoever_ on the corresponding `value` column 
entry (for primitive types, at least).
   
   Don't get me wrong, it's very reasonable to require this, and adding the 
requirement would make primitive shredding match the similar rule for partial 
object shredding (a field in typed_value column must never be present in 
objects stored in the value column).
   
   That said, I don't know if it makes much of a difference for readers 
consuming numeric types, because somebody who sees an int32 `typed_value` 
column probably still has to read the `value` column to pick up int8 values and 
sufficiently narrow int64 values (and possibly even decimal values whose 
fractional part is zero). And once they're doing that, it doesn't actually 
change the algorithm at all if the `value` column can contain values that 
"should" have shredded and did not. Depending on the reader's casting 
semantics, they might need to pick up decimal and float columns as well. Maybe 
even string columns (i.e. spark's casting semantics would attempt to parse 
strings into integers).
   
   Going a step further -- will (should?) the shredding spec take a stance on 
whether writers are required to perform widening and/or lossless narrowing 
conversions and/or lossy narrowing conversions of values whose types are in the 
same [equivalence 
class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)
 as the `typed_value` column? 
   * the variant spec places the scale 9 decimal4 value `0.035000000`, the int8 
value `-5` and the int64 value `8589934592` all in the same "exact numeric" 
variant equivalence class, which indicates "logical equivalence" for which 
"user expressions... should behave the same."
   * i.e. does the presence of an int32 `typed_value` column forbid the 
existence of int8 entries in the `value` column? Does it forbid "narrow" int64 
entries such as `10` that convert losslessly to int32? What about 
decimal4(scale=5) values such as `15.0000` that allows a "lossless" (**) 
conversion to int32 but requires an expensive integer `%` operation to detect? 
What about decimal4(scale=5) values such as `1.23000` that could be rounded to 
int32?
   
   (**) I use scare quotes for "lossless" conversion of 1.00000 to 1 because 
some would argue that the presence of those five trailing decimal zeros is 
information that would be lost by conversion to int (the input had six 
significant figures and the output has only one).



##########
VariantShredding.md:
##########
@@ -214,9 +214,9 @@ The table below shows how the series of objects in the 
first column would be sto
 | `{ }`                                                                        
      | null                              | non-null      | null                
           | null                                 | null                        
 | null                               | Object is present but empty             
         |
 | null                                                                         
      | `00` (null)                       | null          |                     
           |                                      |                             
 |                                    | Object/value is null                    
         |
 | missing                                                                      
      | null                              | null          |                     
           |                                      |                             
 |                                    | Object/value is missing                 
         |
-| INVALID                                                                      
      | `{"event_type": "login"}`         | non-null      | null                
           | `login`                              | null                        
 | 1729795057774                      | INVALID: Shredded field is present in 
`value`    |
-| INVALID                                                                      
      | `"a"`                             | non-null      | null                
           | null                                 | null                        
 | null                               | INVALID: `typed_value` is present for 
non-object |
-| INVALID                                                                      
      | `02 00` (object with 0 fields)    | null          |                     
           |                                      |                             
 |                                    | INVALID: `typed_value` is null for 
object        |
+| INVALID CASE: `{"event_type": "login", "event_ts": 1729795057774}`           
      | `{"event_type": "login"}`         | non-null      | null                
           | `login`                              | null                        
 | 1729795057774                      | INVALID: Shredded field is present in 
`value`    |
+| INVALID CASE: `"a"`                                                          
      | `"a"`                             | non-null      | null                
           | null                                 | null                        
 | null                               | INVALID: `typed_value` is present for 
non-object |
+| INVALID CASE: `{}`                                                           
      | `02 00` (object with 0 fields)    | null          |                     
           |                                      |                             
 |                                    | INVALID: `typed_value` is null for 
object        |

Review Comment:
   What about this case?
   ```
   | `{"event_type": "login"}`                 | `02 00` (object with 0 fields) 
| non-null      | null                           | `login`
   ```
   Is it legal for an object that shredded perfectly to still have an empty 
object in the `value` column? It's sub-optimal but is it incorrect?
   
   Conversely, what about this case?
   ```
   | `{"event_type": "login"}`                 | null | non-null      | null    
                       | `login`
   ```
   Is it allowed for an object that shredded perfectly to have NULL in the 
`value` column? Or is it required to have an empty object there? (guessing it's 
allowed to have NULL, and maybe even required to do so)



##########
VariantShredding.md:
##########
@@ -214,9 +214,9 @@ The table below shows how the series of objects in the 
first column would be sto
 | `{ }`                                                                        
      | null                              | non-null      | null                
           | null                                 | null                        
 | null                               | Object is present but empty             
         |
 | null                                                                         
      | `00` (null)                       | null          |                     
           |                                      |                             
 |                                    | Object/value is null                    
         |
 | missing                                                                      
      | null                              | null          |                     
           |                                      |                             
 |                                    | Object/value is missing                 
         |
-| INVALID                                                                      
      | `{"event_type": "login"}`         | non-null      | null                
           | `login`                              | null                        
 | 1729795057774                      | INVALID: Shredded field is present in 
`value`    |
-| INVALID                                                                      
      | `"a"`                             | non-null      | null                
           | null                                 | null                        
 | null                               | INVALID: `typed_value` is present for 
non-object |
-| INVALID                                                                      
      | `02 00` (object with 0 fields)    | null          |                     
           |                                      |                             
 |                                    | INVALID: `typed_value` is null for 
object        |
+| INVALID CASE: `{"event_type": "login", "event_ts": 1729795057774}`           
      | `{"event_type": "login"}`         | non-null      | null                
           | `login`                              | null                        
 | 1729795057774                      | INVALID: Shredded field is present in 
`value`    |
+| INVALID CASE: `"a"`                                                          
      | `"a"`                             | non-null      | null                
           | null                                 | null                        
 | null                               | INVALID: `typed_value` is present for 
non-object |

Review Comment:
   This took a minute to decipher... it's illegal because NULL/NULL for 
typed_value.event_type.* columns means the `event_type` field is missing (but 
the `value` column value `"a"` claims this isn't an object in the first place).



##########
VariantShredding.md:
##########
@@ -214,9 +214,9 @@ The table below shows how the series of objects in the 
first column would be sto
 | `{ }`                                                                        
      | null                              | non-null      | null                
           | null                                 | null                        
 | null                               | Object is present but empty             
         |
 | null                                                                         
      | `00` (null)                       | null          |                     
           |                                      |                             
 |                                    | Object/value is null                    
         |
 | missing                                                                      
      | null                              | null          |                     
           |                                      |                             
 |                                    | Object/value is missing                 
         |
-| INVALID                                                                      
      | `{"event_type": "login"}`         | non-null      | null                
           | `login`                              | null                        
 | 1729795057774                      | INVALID: Shredded field is present in 
`value`    |
-| INVALID                                                                      
      | `"a"`                             | non-null      | null                
           | null                                 | null                        
 | null                               | INVALID: `typed_value` is present for 
non-object |
-| INVALID                                                                      
      | `02 00` (object with 0 fields)    | null          |                     
           |                                      |                             
 |                                    | INVALID: `typed_value` is null for 
object        |
+| INVALID CASE: `{"event_type": "login", "event_ts": 1729795057774}`           
      | `{"event_type": "login"}`         | non-null      | null                
           | `login`                              | null                        
 | 1729795057774                      | INVALID: Shredded field is present in 
`value`    |

Review Comment:
   This one is invalid because the `event_type` field is present in both 
`value` and `typed_value` columns of a partially shredded object.
   
   We're actually missing a second (similar) invalid row:
   ```
   | INVALID CASE: `{"event_type": "login", "event_ts": 1729795057774}`         
        | `{"event_type": "login"}`         | null      |                       
     |
   ```
   (because `typed_value.login` exists, an object stored in `value` must never 
contain that field name)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve wording about error cases in `VariantShredding` [parquet-format]

Reply via email to