andrei-ionescu opened a new issue #982:
URL: https://github.com/apache/arrow-rs/issues/982


   **Describe the bug**
   Reading Parquet file with timestamp column containing a future date like 
`9999-12-31 02:00:00` year results in overflow panic with the following output:
   ```
   thread 'tokio-runtime-worker' panicked at 'attempt to multiply with overflow'
   ```
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   1. Download the attached zip file that contains the parquet file: 
[data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet.zip](https://github.com/apache/arrow-datafusion/files/7601988/data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet.zip)
   2. Unzip it and it should give you the 
`data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet` file.
   3. Create a new project with `cargo new read-parquet`, create a `data` 
folder in your project and put the parquet file in the `data` folder inside 
your project.
   4. Modify the `Cargo.toml` file to contain the following:
   ```toml
   [package]
   name = "read-parquet"
   version = "0.1.0"
   edition = "2021"
   
   [dependencies]
   tokio = "1.14"
   arrow = "6.0"
   datafusion = "6.0"
   ```
   4. Put the following code in `main.rs` to read the given parquet file:
   ```rust
   use datafusion::prelude::*;
   
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
       let mut ctx = ExecutionContext::new(); 
       /* 
        * Parquet file schema:
        *
        * message spark_schema {
        *   optional binary licence_code (UTF8);
        *   optional binary vehicle_make (UTF8);
        *   optional binary fuel_type (UTF8);
        *   optional int96 dimension_load_date;
        * }
        */
       ctx
           .register_parquet("vehicles", 
"./data/data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet")
           .await?;
       let df = ctx
           .sql("
               SELECT
                   licence_code,
                   vehicle_make,
                   fuel_type,
                   CAST(dimension_load_date as TIMESTAMP) as dms
               FROM vehicles
               LiMIT 10
           ")
           .await?;
   
       df
           .show()
           .await?;
   
       Ok(())
   }
   ```
   5. Execute `cargo run`. 
   6. Result:
   ```
   thread 'tokio-runtime-worker' panicked at 'attempt to multiply with 
overflow', 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:46
   stack backtrace:
      0: rust_begin_unwind
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
      1: core::panicking::panic_fmt
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
      2: core::panicking::panic
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:48:5
      3: <parquet::arrow::converter::Int96ArrayConverter as 
parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert::{{closure}}::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:46
      4: core::option::Option<T>::map
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/option.rs:846:29
      5: <parquet::arrow::converter::Int96ArrayConverter as 
parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:30
      6: core::iter::adapters::map::map_fold::{{closure}}
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:84:28
      7: core::iter::traits::iterator::Iterator::fold
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:2171:21
      8: <core::iter::adapters::map::Map<I,F> as 
core::iter::traits::iterator::Iterator>::fold
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:124:9
      9: core::iter::traits::iterator::Iterator::for_each
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:737:9
     10: <alloc::vec::Vec<T,A> as 
alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/spec_extend.rs:40:17
     11: <alloc::vec::Vec<T> as 
alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/spec_from_iter_nested.rs:56:9
     12: alloc::vec::source_iter_marker::<impl 
alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/source_iter_marker.rs:31:20
     13: <alloc::vec::Vec<T> as 
core::iter::traits::collect::FromIterator<T>>::from_iter
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2549:9
     14: core::iter::traits::iterator::Iterator::collect
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:1745:9
     15: <parquet::arrow::converter::Int96ArrayConverter as 
parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:177:13
     16: <parquet::arrow::converter::ArrayRefConverter<S,A,C> as 
parquet::arrow::converter::Converter<S,alloc::sync::Arc<dyn 
arrow::array::array::Array>>>::convert
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:450:9
     17: <parquet::arrow::array_reader::ComplexObjectArrayReader<T,C> as 
parquet::arrow::array_reader::ArrayReader>::next_batch
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:545:25
     18: <parquet::arrow::array_reader::StructArrayReader as 
parquet::arrow::array_reader::ArrayReader>::next_batch::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:1130:27
     19: core::iter::adapters::map::map_try_fold::{{closure}}
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:91:28
     20: core::iter::traits::iterator::Iterator::try_fold
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:1995:21
     21: <core::iter::adapters::map::Map<I,F> as 
core::iter::traits::iterator::Iterator>::try_fold
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:117:9
     22: <parquet::arrow::array_reader::StructArrayReader as 
parquet::arrow::array_reader::ArrayReader>::next_batch
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:1127:30
     23: <parquet::arrow::arrow_reader::ParquetRecordBatchReader as 
core::iter::traits::iterator::Iterator>::next
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/arrow_reader.rs:175:15
     24: datafusion::physical_plan::file_format::parquet::read_partition
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/file_format/parquet.rs:424:19
     25: <datafusion::physical_plan::file_format::parquet::ParquetExec as 
datafusion::physical_plan::ExecutionPlan>::execute::{{closure}}::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/file_format/parquet.rs:213:29
     26: <tokio::runtime::blocking::task::BlockingTask<T> as 
core::future::future::Future>::poll
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/task.rs:42:21
     27: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:161:17
     28: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/loom/std/unsafe_cell.rs:14:9
     29: tokio::runtime::task::core::CoreStage<T>::poll
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:151:13
     30: tokio::runtime::task::harness::poll_future::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:461:19
     31: <core::panic::unwind_safe::AssertUnwindSafe<F> as 
core::ops::function::FnOnce<()>>::call_once
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panic/unwind_safe.rs:271:9
     32: std::panicking::try::do_call
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:406:40
     33: <unknown>
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/distinct_expressions.rs:127:15
     34: std::panicking::try
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:370:19
     35: std::panic::catch_unwind
                at 
/rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panic.rs:133:14
     36: tokio::runtime::task::harness::poll_future
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:449:18
     37: tokio::runtime::task::harness::Harness<T,S>::poll_inner
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:98:27
     38: tokio::runtime::task::harness::Harness<T,S>::poll
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:53:15
     39: tokio::runtime::task::raw::poll
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:113:5
     40: tokio::runtime::task::raw::RawTask::poll
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:70:18
     41: tokio::runtime::task::UnownedTask<S>::run
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/mod.rs:379:9
     42: tokio::runtime::blocking::pool::Inner::run
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:264:17
     43: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
                at 
/Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:244:17
   ```
   
   **Expected behavior**
   To be able to read that parquet file. **The parquet file can be read with 
`parquet-tools` CLI and Apache Spark.**
   
   **Additional context**
   The root cause is the fact that the parquet file contains some rows with 
`9999-12-31 02:00:00` in the `dimension_load_date` column. **This future date 
is supported by Parquet and Spark**.
   
   The content of the parquet file is:
   ```
   +------------+------------------+------------------+-------------------+
   |licence_code|vehicle_make      |fuel_type         |dimension_load_date|
   +------------+------------------+------------------+-------------------+
   |odc-odbl    |**Not Provided**  |**Not Provided**  |9999-12-31 02:00:00|
   |odc-odbl    |**Not Applicable**|**Not Applicable**|9998-12-31 02:00:00|
   |odc-odbl    |SAVIEM            |Petrol            |2021-06-09 03:02:37|
   |odc-odbl    |YAMAHA            |Petrol            |2021-06-09 03:43:47|
   |odc-odbl    |VAUXHALL          |Petrol            |2020-10-18 03:23:47|
   |odc-odbl    |VAUXHALL          |Petrol            |2021-06-09 03:02:37|
   |odc-odbl    |BMW               |Petrol            |2021-06-09 03:38:39|
   |odc-odbl    |MG                |Petrol            |2020-10-18 03:23:47|
   |odc-odbl    |PEUGEOT           |Diesel            |2020-10-18 03:35:16|
   |odc-odbl    |FORD              |Diesel            |2020-10-18 03:23:47|
   |odc-odbl    |FORD              |Petrol            |2020-10-18 03:12:55|
   |odc-odbl    |SKODA             |Diesel            |2021-06-09 03:02:37|
   |odc-odbl    |SHOGUN            |Diesel            |2020-10-18 03:12:55|
   |odc-odbl    |MITSUBISHI        |Diesel            |2021-06-10 01:15:47|
   +------------+------------------+------------------+-------------------+
   ```
   
   To find out more about how the root cause was detected you can follow 
apache/arrow-datafusion#1359.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to