[GitHub] [arrow-datafusion] jorgecarleitao edited a comment on issue #1360: Reading Parquet file with timestamp column with `9999` year results in overflow panic

GitBox Thu, 25 Nov 2021 13:44:54 -0800


jorgecarleitao edited a comment on issue #1360:
URL: 
https://github.com/apache/arrow-datafusion/issues/1360#issuecomment-979478467



   Point taken wrt to the int96 deprecation.
   
   The datetime "9999-12-31" is `253402214400` seconds in unix timestamp:
   
   ```bash
   $ python -c "import datetime; 
print(datetime.datetime(year=9999,month=12,day=31).timestamp())"
   253402214400.0
   ```
   
   in nanoseconds, this corresponds to `253402214400 * 10^9 = 
253_402_214_400_000_000_000`. The maximum `i64` in Rust [equals 
to](https://doc.rust-lang.org/std/i64/constant.MAX.html) 
`9_223_372_036_854_775_807`. Comparing the two, we have:
   
   ```
   253_402_214_400_000_000_000 >
     9_223_372_036_854_775_807
   ```
   
   This was the rational I used to conclude that we can't fit "9999-12-31" in 
an i64 nanosecond since epoch. Since Java's Long is also i64 with the same 
maximum as Rust, I concluded that Spark must be discarding _something_ to fit 
such a date in a Long, since there is just not sufficient precision to 
represent that date in i64 ns. So, I looked for what they did.
   
   `int96` represents `[i64 nanos, i32 days]`. When reading such bytes from 
parquet, the interface that Spark uses must be something that consumes such 
types, and `fromJulianDay(days: Int, nanos: Long)` is the only one that does 
such a thing. As I mentioned, that code truncates the nanoseconds, which is 
consistent with being able to read that date _in microseconds_ (the two numbers 
above do not differ by more than 1000).
   
   I may be wrong.
   
   The parquet code in Rust is 
[here](https://github.com/apache/arrow-rs/blob/master/parquet/src/data_type.rs#L65).
 Note that it only goes to millis. The conversion to ns is done 
[here](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/converter.rs#L179).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jorgecarleitao edited a comment on issue #1360: Reading Parquet file with timestamp column with `9999` year results in overflow panic

Reply via email to