Re: [I] `to_timestamp()` wrong value reading from parquet [arrow-datafusion]

via GitHub Mon, 30 Oct 2023 10:30:55 -0700


alamb commented on issue #7958:
URL: 
https://github.com/apache/arrow-datafusion/issues/7958#issuecomment-1785720753


   > If so, I'd 100% be on board copying the behavior of these other well known 
databases
   
   I agree
   
   I took a look at what `ts.snappy.parquet` contains:
   ```
   $ parquet-tools schema -d ts.snappy.parquet
   message spark_schema {
     required int96 a;
   }
   
   creator: parquet-mr version 1.10.99.7.1.7.2000-305 (build 
eeabcd207c4c506ebd915865772cadb9bac25837)
   extra: org.apache.spark.version = 2.4.7
   extra: org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"a","type":"timestamp","nullable":false,"metadata":{}}]}
   
   file schema: spark_schema
   
--------------------------------------------------------------------------------
   a: REQUIRED INT96 R:0 D:0
   
   row group 1: RC:1 TS:44 OFFSET:4
   
--------------------------------------------------------------------------------
   a:  INT96 SNAPPY DO:0 FPO:4 SZ:48/44/0.92 VC:1 
ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0x0000000000000000C4441A00, max: 
0x0000000000000000C4441A00, num_nulls: 0]
   ```
   
   It seem to use a different type and has extra metadata that is not present 
in an equivalent file  created by datafusion:
   
   ```
   ❯ select to_timestamp_seconds(-62125747200);
   +-------------------------------------------+
   | to_timestamp_seconds(Int64(-62125747200)) |
   +-------------------------------------------+
   | 0001-04-25T00:00:00                       |
   +-------------------------------------------+
   1 row in set. Query took 0.001 seconds.
   
   ❯ copy (select to_timestamp_seconds(-62125747200) as "a") to 'ts-df.parquet';
   +-------+
   | count |
   +-------+
   | 1     |
   +-------+
   1 row in set. Query took 0.024 seconds.
   ```
   
   The field is not read as a timestamp at all 🤔 
   
   
   
   ```
   ❯ select * from 'ts-df.parquet';
   +--------------+
   | a            |
   +--------------+
   | -62125747200 |
   +--------------+
   1 row in set. Query took 0.004 seconds.
   ```
   
   And the metadata / type information is different than spark:
   
   ```shell
   $ parquet-tools schema -d ts-df.parquet
   message arrow_schema {
     required int64 a;
   }
   
   creator: datafusion version 32.0.0
   
   file schema: arrow_schema
   
--------------------------------------------------------------------------------
   a: REQUIRED INT64 R:0 D:0
   
   row group 1: RC:1 TS:63 OFFSET:4
   
--------------------------------------------------------------------------------
   a:  INT64 ZSTD DO:4 FPO:35 SZ:81/63/0.78 VC:1 ENC:RLE,PLAIN,RLE_DICTIONARY 
ST:[min: -62125747200, max: -62125747200, num_nulls not defined]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `to_timestamp()` wrong value reading from parquet [arrow-datafusion]

Reply via email to