[I] write duration to parquet but read as int64 [arrow-rs]

via GitHub Thu, 11 Apr 2024 00:10:46 -0700


Liyixin95 opened a new issue, #5625:
URL: https://github.com/apache/arrow-rs/issues/5625


   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   As the title says, the `ParquetRecordBatchReader` can not recognize duration 
type written by pandas or polars.
   
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   First, we should prepare parquet file
   ```python
   import polars as pl
   from datetime import timedelta
   
   df = pl.DataFrame({
       "a":  [timedelta(days=1) for _ in range(100)]
   })
   
   df.write_parquet("./test.parquet")
   ```
   
   Then, read in rust arrow-rs:
   ```rust
   fn main() -> Result<()> {
       // Create parquet file that will be read.
   
       let path = "./test.parquet";
       let file = File::open(path).unwrap();
   
       let parquet_reader = ParquetRecordBatchReaderBuilder::try_new(file)?
           .with_batch_size(8192)
           .build()?;
   
       let mut batches = Vec::new();
   
       for batch in parquet_reader {
           batches.push(batch?);
       }
   
       println!("{:#?}", batches[0].schema());
   
       Ok(())
   }
   ```
   finally we get:
   ```
   Schema {
       fields: [
           Field {
               name: "a",
               data_type: Int64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
       ],
       metadata: {},
   }
   ```
   
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   polars result:
   ```
   shape: (100, 1)
   ┌──────────────┐
   │ a            │
   │ ---          │
   │ duration[μs] │
   ╞══════════════╡
   │ 1d           │
   │ 1d           │
   │ 1d           │
   │ 1d           │
   │ 1d           │
   │ …            │
   │ 1d           │
   │ 1d           │
   │ 1d           │
   │ 1d           │
   │ 1d           │
   └──────────────┘
   ```
   pandas result:
   ```
           a
   0  1 days
   1  1 days
   2  1 days
   3  1 days
   4  1 days
   ..    ...
   95 1 days
   96 1 days
   97 1 days
   98 1 days
   99 1 days
   
   [100 rows x 1 columns]
   ```
   
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] write duration to parquet but read as int64 [arrow-rs]

Reply via email to