jorisvandenbossche commented on issue #43908:
URL: https://github.com/apache/arrow/issues/43908#issuecomment-2326806045

   Comparing the two files (written with pyarrow 12.0.1 and with 14.0.2), the 
main difference I notice is the different name used for the list element. 
Comparing both Parquet schemas:
   
   ```
   In [9]: meta12.schema
   Out[9]: 
   <pyarrow._parquet.ParquetSchema object at 0x7f542c306800>
   required group field_id=-1 schema {
     optional group field_id=-1 sessions_array (List) {
       repeated group field_id=-1 list {
         optional group field_id=-1 item {
           optional fixed_len_byte_array(17) field_id=-1 item 
(Decimal(precision=40, scale=20));
         }
       }
     }
   }
   
   In [10]: meta14.schema
   Out[10]: 
   <pyarrow._parquet.ParquetSchema object at 0x7f542c4ac9c0>
   required group field_id=-1 schema {
     optional group field_id=-1 sessions_array (List) {
       repeated group field_id=-1 list {
         optional group field_id=-1 element {     # <------------------- this 
is "element" and not "item"
           optional fixed_len_byte_array(17) field_id=-1 item 
(Decimal(precision=40, scale=20));
         }
       }
     }
   }
   ```
   
   This is was a deliberate change in pyarrow to follow more closely the 
parquet spec 
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists), 
and can be controlled by passing `use_compliant_nested_type=True/False` 
(https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html).
  
   While it is closer to the spec, maybe bigquery does something different 
based on that. @matteosdocsity could you try again with setting that to False 
to see if that changes anything?
   
   In the metadata I also see that the listed encodings have a different order 
(pyarrow 12 puts RLE_DICTIONARY first, while pyarrow 14 puts PLAIN first). Just 
to try to narrow down the issue, @matteosdocsity you could also try tweaking 
some parameters based on that (e.g. `use_dictionary=False`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to