etseidl commented on PR #6313: URL: https://github.com/apache/arrow-rs/pull/6313#issuecomment-2364062347
> Thank you for taking the time to look at this @etseidl. I'm still new to the project so I have plenty to learn. 👋 Welcome! I'm pretty new here too, and still learning as well 😄. > > I believe the approach called for in #1938 is to write un-annotated INT64, and rely on the encoded arrow schema to know how to interpret the column. > > So if we can't embed the fact that the field refers to a date in the Parquet `LogicalType`, do we provide additional type information during/after reading to interpret `INT64` columns as `Date64`? Is this what was meant by "embed logical type in arrow schema only" from #1938? Yes. The Parquet format allows for key-value pairs in the metadata. What the arrow writers will do for certain types that aren't representable in Parquet (most of them time based), is to embed a base64 encoded _arrow_ schema (which is serialized with flatbuffers) in one of these key/value pairs, with the key being `ARROW:schema`. Take a look at the code [here](https://github.com/apache/arrow-rs/blob/4683c200f457e7f95a22ed8577c48df8997eadaa/parquet/src/arrow/schema/mod.rs#L143), and trace backwards to see how it's used. My hope is that the arrow schema will preserve the fact that in arrow it was a `Date64` with no further changes (i.e. if `coerce==true` then do what is currently done, if `false` then don't scale the value and write as `INT64` with no Parquet LogicalType). On the read side, I'm hoping the arrow type will still be `Date64`, and then you just need to account for the case where the Parquet type is `INT64` (the `INT32` case is already there since coercing is the current behavior). Did that make sense? Hopefully someone with more arrow-side experience will correct me if I've messed up some details. > On another note, are you OK with the breaking change introduced by: `arrow_to_parquet_schema(schema: &Schema, coerce_types: bool)`? Breaking changes are ok, but with the caveat that they're only allowed in major releases (with the next 54.0.0 release slated for December). IIUC what will happen is once this is PR ready for merge, it will be changed to "Draft" until development on 54.0.0 commences. I'll take another look and see if there's a way to avoid breaking the API. Perhaps new functions that accept the `coerce` flag, and the existing ones call the new ones passing `true`. I don't know offhand how many types are impacted, and whether they all currently behave as if coerce is true, though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
