jorisvandenbossche commented on issue #38171:
URL: https://github.com/apache/arrow/issues/38171#issuecomment-1755398679

   > The input parquet bytes data is created using pyarrow 12.0.0 with 
`datetime64[ns]` in the input df.
   > When convert bytes data back to pandas df the unit is still 
`datetime64[ns]` using pyarrow 12.0.0.
   > But the unit becomes `datetime64[us]` when using pyarrow 13.0.0.
   
   What is happening here is that with pyarrow < 13.0, the Parquet file itself 
was always having `us` resolution (as we limitation of the Parquet file 
format). When reading it back in as a pyarrow.Table, you get a timestamp[us] 
column, but when converting to pandas, this was converted back to `ns` 
resolution, because that was the only resolution supported by pandas until 
recently. 
   So while with older pyarrow versions, it _appeared_ you had a correct 
roundtrip from ns -> parquet -> ns, in practice this actually always went 
through microseconds inside the Parquet file.
   
   What change in pyarrow 13.0 is two things:
   
   - We update the Parquet format version that we write by default, and now 
with the default settings pyarrow writes nanoseconds (this can still be 
disabled with `version="2.4"`, or also in older pyarrow, you could already 
enable it by passing `version="2.6"`)
   - When converting to pandas, and if pandas >= 2 is installed, we no longer 
convert any resolution to nanoseconds, but preserve the resolution of the arrow 
data in the conversion to pandas
   
   This should explain all the possible combinations of write/read with pyarrow 
12/13.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to