qzyu999 opened a new issue, #839:
URL: https://github.com/apache/arrow-go/issues/839

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   # Summary:
   The `valuesize()` function in `parquet/variant/utils.go` checks `(typeinfo 
>> 4) & 0x1` to determine the `is_large` flag for both objects and arrays. 
While this is correct for objects, it is incorrect for arrays.
   
   According to the Parquet Variant spec, the array layout shifts the 
`is_large` flag to bit position 2 of the value header, rather than bit 4.
   
   # Root Cause Analysis
   The specification defines different header layouts to optimize space for 
objects vs. arrays:
   
   ## Object value_header (6 bits):
   ```text
   Bit Position:  [ 5 ]   [ 4 ]    [ 3   2 ]    [ 1   0 ]
   Data Stored:  Unused  is_large   field_id_sz    offset_sz
                            ▲
                   (Correctly checks Bit 4)
   ```
   ## Array value_header (6 bits):
   ```text
   Bit Position:  [ 5   4   3 ]    [ 2 ]    [ 1   0 ]
   Data Stored:      Unused       is_large   offset_sz
                                     ▲
                            (Should check Bit 2!)
   ```
   # Evidence
   
   ## The Bug (parquet/variant/utils.go):
   ```go
      case byte(basicarray):
          var szbytes uint8 = 1
          if ((typeinfo >> 4) & 0x1) != 0 { // ❌ Error: Checks bit 4 instead of 
bit 2
              szbytes = 4
          }
   ```
   ## The Correct Implementation (parquet/variant/variant.go):
   ```go
      case basicarray:
          valuehdr := (v.value[0] >> basictypebits)
          fieldoffsetsz := (valuehdr & 0b11) + 1
          islarge := ((valuehdr >> 2) & 0b1) == 1 //  Correct: Checks bit 2
   ```
      
   # Impact
   This causes `valuesize()` to return an incorrect size for arrays using 
4-byte offsets `(is_large = true)`. This leads directly to silent data 
corruption or panics during writes/compactions—specifically when 
`FinishObject()` compacts duplicate keys whose values happen to be large arrays.
   
   # Suggested Fix
   Update the `basicarray` case in `parquet/variant/utils.go` to shift by 2 
instead of 4:
   ```go
   case byte(basicarray):
        var szbytes uint8 = 1
        if ((typeinfo >> 2) & 0x1) != 0 { //  Fix: Shift by 2 for arrays
                szbytes = 4
        }
   ```
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to