[GitHub] [arrow-rs] houqp opened a new issue #455: Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow

GitBox Mon, 14 Jun 2021 00:59:58 -0700


houqp opened a new issue #455:
URL: https://github.com/apache/arrow-rs/issues/455



   **Describe the bug**
   
   Here is a pandas dataframe with nanosecond timestamp Data index:
   
   ```
   >>> hist.index
   DatetimeIndex(['1986-03-13', '1986-03-14', '1986-03-17', '1986-03-18',
                  '1986-03-19', '1986-03-20', '1986-03-21', '1986-03-24',
                  '1986-03-25', '1986-03-26',
                  ...
                  '2021-05-28', '2021-06-01', '2021-06-02', '2021-06-03',
                  '2021-06-04', '2021-06-07', '2021-06-08', '2021-06-09',
                  '2021-06-10', '2021-06-11'],
                 dtype='datetime64[ns]', name='Date', length=8885, freq=None)
   ```
   
   When storing this dataframe into parquet 1.0 format, pyarrow stores the Date 
column in microsecond unit. pyarrow is able to load the Date column with 
microsecond precision as well:
   
   ```
   >>> from pyarrow.parquet import ParquetFile
   >>> pp = ParquetFile("test_data/msft.parquet")
   >>> pp.metadata.schema
   <pyarrow._parquet.ParquetSchema object at 0x7f720d1bbac0>
   required group field_id=0 schema {
     optional double field_id=1 Open;
     optional double field_id=2 High;
     optional double field_id=3 Low;
     optional double field_id=4 Close;
     optional int64 field_id=5 Volume;
     optional double field_id=6 Dividends;
     optional double field_id=7 StockSplits;
     optional int64 field_id=8 Date (Timestamp(isAdjustedToUTC=false, 
timeUnit=microseconds, is_from_converted_type=false, 
force_set_converted_type=false));
   }
   ```
   
   But when loaded using arrow parquet crate, it is incorrectly loaded as 
nanosecond timestamp type.
   
   **To Reproduce**
   
   Here is a sample file to reproduce the issue: 
https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.
   
   The file can be reproduced with the following python code:
   
   ```python
   import yfinance as yf
   hist = yf.Ticker('MSFT').history(period="max")
   hist.to_parquet('msft.parquet')
   ```
   
   **Expected behavior**
   
   `Data` column should be loaded as micro second precision.
   
   **Additional context**
   
   Arrow parquet crate handles parquet 2.0 files without any issue.
   
   Initially reported in https://github.com/roapi/roapi/issues/42.
   
   Here is the decoded ipc field from the `'ARROW:schema'` metadata for the 
Date column in arrow crate:
   
   ```
   Field {
       name: Some(
           "Date",
       ),
       nullable: true,
       type_type: Timestamp,
       type_: Timestamp {
           unit: NANOSECOND,
           timezone: None,
       },
       dictionary: None,
       children: Some(
           [],
       ),
       custom_metadata: None,
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] houqp opened a new issue #455: Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow

Reply via email to