jorisvandenbossche commented on issue #38171: URL: https://github.com/apache/arrow/issues/38171#issuecomment-1755398679
> The input parquet bytes data is created using pyarrow 12.0.0 with `datetime64[ns]` in the input df. > When convert bytes data back to pandas df the unit is still `datetime64[ns]` using pyarrow 12.0.0. > But the unit becomes `datetime64[us]` when using pyarrow 13.0.0. What is happening here is that with pyarrow < 13.0, the Parquet file itself was always having `us` resolution (as we limitation of the Parquet file format). When reading it back in as a pyarrow.Table, you get a timestamp[us] column, but when converting to pandas, this was converted back to `ns` resolution, because that was the only resolution supported by pandas until recently. So while with older pyarrow versions, it _appeared_ you had a correct roundtrip from ns -> parquet -> ns, in practice this actually always went through microseconds inside the Parquet file. What change in pyarrow 13.0 is two things: - We update the Parquet format version that we write by default, and now with the default settings pyarrow writes nanoseconds (this can still be disabled with `version="2.4"`, or also in older pyarrow, you could already enable it by passing `version="2.6"`) - When converting to pandas, and if pandas >= 2 is installed, we no longer convert any resolution to nanoseconds, but preserve the resolution of the arrow data in the conversion to pandas This should explain all the possible combinations of write/read with pyarrow 12/13. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
