jorisvandenbossche commented on issue #38000: URL: https://github.com/apache/arrow/issues/38000#issuecomment-1746244230
@IkeNefcy the file you uploaded (https://github.com/apache/arrow/issues/38000#issuecomment-1745759195) is created with pyarrow 12.0, and is the file that works fine, is that correct? So for that one, it is expected the timestamps are stored as microseconds, and this should work almost anywhere. As @mapleFU mentioned, with https://github.com/apache/arrow/pull/36137, we changed the default to start writing nanoseconds (if your original data is in nanoseconds, which is the case when starting from pandas) with pyarrow 13.0. I assume that the parquet reader you are using with Spectrum is incorrectly reading those files. That's something best reported to them. You can check the metadata of the Parquet file that was written using pyarrow as follows (using this with the file you uploaded): ``` In [17]: import pyarrow.parquet as pq In [18]: meta = pq.read_metadata("Downloads/test") In [19]: meta Out[19]: <pyarrow._parquet.FileMetaData object at 0x7f4546eeae30> created_by: parquet-cpp-arrow version 12.0.0 num_columns: 4 num_rows: 1 num_row_groups: 1 format_version: 2.6 serialized_size: 2754 In [20]: meta.schema Out[20]: <pyarrow._parquet.ParquetSchema object at 0x7f4546e94500> required group field_id=-1 schema { optional int64 field_id=-1 start_time_local (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=-1 end_time_local (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=-1 start_time_utc (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); optional int64 field_id=-1 end_time_utc (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)); } ``` (and so you can see here that this file was created using pyarrow 12.0, and created timestamps with microseconds) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
