[GitHub] [arrow-rs] bezbac opened a new issue, #2580: Improving the CSV schema inference

GitBox Wed, 24 Aug 2022 11:17:50 -0700


bezbac opened a new issue, #2580:
URL: https://github.com/apache/arrow-rs/issues/2580


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   There's an open issue in the datafusion repository with the CSV schema 
inference. The current implementation in arrow will return  `Int64` as the 
datatype for any numeric columns that have no decimal and don't match a date 
format. This circumstance is causing problems when the CSV is read later, 
should the value overflow the `Int64` data type.
   
   Here's the datafusion issue 
https://github.com/apache/arrow-datafusion/issues/3174#issuecomment-1221579911
   
   **Describe the solution you'd like**
   Maybe arrow could try to support the `UInt64` and `Decimal128` datatypes as 
well, should it notice the values inside the CSV are too large. Or even default 
to `String` should it notice that even these are too small to ensure the CSV 
can be read without problems.
   
   **Describe alternatives you've considered**
   Alternatively, I imagine the column's type could be "upgraded" when reading 
the CSV, should there be any parsing errors due to overflows. I imagine this 
would need all previously parsed values to be casted, which could hopefully be 
avoided given better inference results.
   
   **Additional context**
   I'd be open to implementing this change. My naive approach would be 
something like this: 
https://github.com/apache/arrow-rs/commit/4b3104ea431835018c4fb90003013e7d2c7fe47b#
 in case anyone here has any suggestions on how to improve it, I would be very 
happy.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] bezbac opened a new issue, #2580: Improving the CSV schema inference

Reply via email to