CookiePieWw commented on issue #3174:
URL: https://github.com/apache/datafusion/issues/3174#issuecomment-2379426643

   Hi, I took a look based on @bezbac's findings, and found that `arrow-csv` 
uses regex expr to match strings and then infer ther types. The regex for 
`Int64` accept all possible numbers instead of numbers in ranges of `Int64`.
   (Seems github cannot attach codes through links that does not belong to this 
repo, so I copied here)
   
   ```rust
   /// See 
https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs
   /// #L146-L158
       /// Order should match [`InferredDataType`]
       static ref REGEX_SET: RegexSet = RegexSet::new([
           // ...
           r"^-?(\d+)$", //INTEGER
           // ...
       ]).unwrap();
   
   /// See 
https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs
   /// #L214-L223
       /// Updates the [`InferredDataType`] with the given string
       fn update(&mut self, string: &str) {
           self.packed |= if string.starts_with('"') {
               1 << 8 // Utf8
           } else if let Some(m) = REGEX_SET.matches(string).into_iter().next() 
{
               1 << m
           } else {
               1 << 8 // Utf8
           }
       }
   ```
   
   One of the solutions is to change the regex expr for `Int64` to match the 
range of it, but it seems to be a very complicate one. I told gpt to give me 
one but it's more than 300 chars, so I wonder if there're alternatives :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to