Re: [PR] feat(parquet)!: coerce_types flag for date64 [arrow-rs]

via GitHub Fri, 20 Sep 2024 06:33:02 -0700


dsgibbons commented on PR #6313:
URL: https://github.com/apache/arrow-rs/pull/6313#issuecomment-2363746738


   Thank you for taking the time to look at this @etseidl. I'm still new to the 
project so I have plenty to learn.
   
   From #1938:
   
   > If not coerce_types, write as Int64 and embed logical type in arrow schema 
only. 
   
   I think I interpreted this as the Parquet `LogicalType`. I hadn't seen that 
[ref](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date)
 before.
   
   > I believe the approach called for in 
https://github.com/apache/arrow-rs/issues/1938 is to write un-annotated INT64, 
and rely on the encoded arrow schema to know how to interpret the column.
   
   So if we can't embed the fact that the field refers to a date in the Parquet 
`LogicalType`, do we provide additional type information during/after reading 
to interpret `INT64` columns as `Date64`? Is this what was meant by "embed 
logical type in arrow schema only" from #1938? 
   
   I thought that all type information was inferred from the Parquet file. 
Hence why I removed the `INT32(DATE)->Date64` code, as I didn't think there 
would be any way to know whether `INT32(DATE)` was coerced or not. Could you 
please give an example of how a reader would use an arrow schema to correctly 
interpret the columns?
   
   On another note, are you OK with the breaking change introduced by: 
`arrow_to_parquet_schema(schema: &Schema, coerce_types: bool)`?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(parquet)!: coerce_types flag for date64 [arrow-rs]

Reply via email to