CookiePieWw commented on issue #3174: URL: https://github.com/apache/datafusion/issues/3174#issuecomment-2379426643
Hi, I took a look based on @bezbac's findings, and found that `arrow-csv` uses regex expr to match strings and then infer ther types. The regex for `Int64` accept all possible numbers instead of numbers in ranges of `Int64`. (Seems github cannot attach codes through links that does not belong to this repo, so I copied here) ```rust /// See https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs /// #L146-L158 /// Order should match [`InferredDataType`] static ref REGEX_SET: RegexSet = RegexSet::new([ // ... r"^-?(\d+)$", //INTEGER // ... ]).unwrap(); /// See https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs /// #L214-L223 /// Updates the [`InferredDataType`] with the given string fn update(&mut self, string: &str) { self.packed |= if string.starts_with('"') { 1 << 8 // Utf8 } else if let Some(m) = REGEX_SET.matches(string).into_iter().next() { 1 << m } else { 1 << 8 // Utf8 } } ``` One of the solutions is to change the regex expr for `Int64` to match the range of it, but it seems to be a very complicate one. I told gpt to give me one but it's more than 300 chars, so I wonder if there're alternatives :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org