etseidl commented on PR #6313:
URL: https://github.com/apache/arrow-rs/pull/6313#issuecomment-2364062347

   > Thank you for taking the time to look at this @etseidl. I'm still new to 
the project so I have plenty to learn.
   
   👋 Welcome! I'm pretty new here too, and still learning as well 😄.
   
   > > I believe the approach called for in #1938 is to write un-annotated 
INT64, and rely on the encoded arrow schema to know how to interpret the column.
   > 
   > So if we can't embed the fact that the field refers to a date in the 
Parquet `LogicalType`, do we provide additional type information during/after 
reading to interpret `INT64` columns as `Date64`? Is this what was meant by 
"embed logical type in arrow schema only" from #1938?
   
   Yes. The Parquet format allows for key-value pairs in the metadata. What the 
arrow writers will do for certain types that aren't representable in Parquet 
(most of them time based), is to embed a base64 encoded _arrow_ schema (which 
is serialized with flatbuffers) in one of these key/value pairs, with the key 
being `ARROW:schema`. Take a look at the code 
[here](https://github.com/apache/arrow-rs/blob/4683c200f457e7f95a22ed8577c48df8997eadaa/parquet/src/arrow/schema/mod.rs#L143),
 and trace backwards to see how it's used.
   
   My hope is that the arrow schema will preserve the fact that in arrow it was 
a `Date64` with no further changes (i.e. if `coerce==true` then do what is 
currently done, if `false` then don't scale the value and write as `INT64` with 
no Parquet LogicalType). On the read side, I'm hoping the arrow type will still 
be `Date64`, and then you just need to account for the case where the Parquet 
type is `INT64` (the `INT32` case is already there since coercing is the 
current behavior).
   
   Did that make sense? Hopefully someone with more arrow-side experience will 
correct me if I've messed up some details.
   
   > On another note, are you OK with the breaking change introduced by: 
`arrow_to_parquet_schema(schema: &Schema, coerce_types: bool)`?
   
   Breaking changes are ok, but with the caveat that they're only allowed in 
major releases (with the next 54.0.0 release slated for December). IIUC what 
will happen is once this is PR ready for merge, it will be changed to "Draft" 
until development on 54.0.0 commences. I'll take another look and see if 
there's a way to avoid breaking the API. Perhaps new functions that accept the 
`coerce` flag, and the existing ones call the new ones passing `true`. I don't 
know offhand how many types are impacted, and whether they all currently behave 
as if coerce is true, though.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to