alamb commented on issue #7958:
URL:
https://github.com/apache/arrow-datafusion/issues/7958#issuecomment-1785720753
> If so, I'd 100% be on board copying the behavior of these other well known
databases
I agree
I took a look at what `ts.snappy.parquet` contains:
```
$ parquet-tools schema -d ts.snappy.parquet
message spark_schema {
required int96 a;
}
creator: parquet-mr version 1.10.99.7.1.7.2000-305 (build
eeabcd207c4c506ebd915865772cadb9bac25837)
extra: org.apache.spark.version = 2.4.7
extra: org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"a","type":"timestamp","nullable":false,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
a: REQUIRED INT96 R:0 D:0
row group 1: RC:1 TS:44 OFFSET:4
--------------------------------------------------------------------------------
a: INT96 SNAPPY DO:0 FPO:4 SZ:48/44/0.92 VC:1
ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0x0000000000000000C4441A00, max:
0x0000000000000000C4441A00, num_nulls: 0]
```
It seem to use a different type and has extra metadata that is not present
in an equivalent file created by datafusion:
```
❯ select to_timestamp_seconds(-62125747200);
+-------------------------------------------+
| to_timestamp_seconds(Int64(-62125747200)) |
+-------------------------------------------+
| 0001-04-25T00:00:00 |
+-------------------------------------------+
1 row in set. Query took 0.001 seconds.
❯ copy (select to_timestamp_seconds(-62125747200) as "a") to 'ts-df.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row in set. Query took 0.024 seconds.
```
The field is not read as a timestamp at all 🤔
```
❯ select * from 'ts-df.parquet';
+--------------+
| a |
+--------------+
| -62125747200 |
+--------------+
1 row in set. Query took 0.004 seconds.
```
And the metadata / type information is different than spark:
```shell
$ parquet-tools schema -d ts-df.parquet
message arrow_schema {
required int64 a;
}
creator: datafusion version 32.0.0
file schema: arrow_schema
--------------------------------------------------------------------------------
a: REQUIRED INT64 R:0 D:0
row group 1: RC:1 TS:63 OFFSET:4
--------------------------------------------------------------------------------
a: INT64 ZSTD DO:4 FPO:35 SZ:81/63/0.78 VC:1 ENC:RLE,PLAIN,RLE_DICTIONARY
ST:[min: -62125747200, max: -62125747200, num_nulls not defined]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]