houqp opened a new issue #455:
URL: https://github.com/apache/arrow-rs/issues/455
**Describe the bug**
Here is a pandas dataframe with nanosecond timestamp Data index:
```
>>> hist.index
DatetimeIndex(['1986-03-13', '1986-03-14', '1986-03-17', '1986-03-18',
'1986-03-19', '1986-03-20', '1986-03-21', '1986-03-24',
'1986-03-25', '1986-03-26',
...
'2021-05-28', '2021-06-01', '2021-06-02', '2021-06-03',
'2021-06-04', '2021-06-07', '2021-06-08', '2021-06-09',
'2021-06-10', '2021-06-11'],
dtype='datetime64[ns]', name='Date', length=8885, freq=None)
```
When storing this dataframe into parquet 1.0 format, pyarrow stores the Date
column in microsecond unit. pyarrow is able to load the Date column with
microsecond precision as well:
```
>>> from pyarrow.parquet import ParquetFile
>>> pp = ParquetFile("test_data/msft.parquet")
>>> pp.metadata.schema
<pyarrow._parquet.ParquetSchema object at 0x7f720d1bbac0>
required group field_id=0 schema {
optional double field_id=1 Open;
optional double field_id=2 High;
optional double field_id=3 Low;
optional double field_id=4 Close;
optional int64 field_id=5 Volume;
optional double field_id=6 Dividends;
optional double field_id=7 StockSplits;
optional int64 field_id=8 Date (Timestamp(isAdjustedToUTC=false,
timeUnit=microseconds, is_from_converted_type=false,
force_set_converted_type=false));
}
```
But when loaded using arrow parquet crate, it is incorrectly loaded as
nanosecond timestamp type.
**To Reproduce**
Here is a sample file to reproduce the issue:
https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.
The file can be reproduced with the following python code:
```python
import yfinance as yf
hist = yf.Ticker('MSFT').history(period="max")
hist.to_parquet('msft.parquet')
```
**Expected behavior**
`Data` column should be loaded as micro second precision.
**Additional context**
Arrow parquet crate handles parquet 2.0 files without any issue.
Initially reported in https://github.com/roapi/roapi/issues/42.
Here is the decoded ipc field from the `'ARROW:schema'` metadata for the
Date column in arrow crate:
```
Field {
name: Some(
"Date",
),
nullable: true,
type_type: Timestamp,
type_: Timestamp {
unit: NANOSECOND,
timezone: None,
},
dictionary: None,
children: Some(
[],
),
custom_metadata: None,
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]