[I] DECIMAL regex in csv reader does not accept positive exponent specifier [arrow-rs]

via GitHub Mon, 15 Apr 2024 09:01:19 -0700


jdcasale opened a new issue, #5648:
URL: https://github.com/apache/arrow-rs/issues/5648


   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   Decimals in scientific notation are frequently expressed with a (admittedly 
unnecessary) positive exponent specifier, e.g "3.106e+04". [The existing regex 
](https://github.com/apache/arrow-rs/blame/master/arrow-csv/src/reader/mod.rs#L151)allows
 for negative exponent specifiers, but does not recognize a number with a 
positive specifier. This causes the parser to infer the type of any column with 
positive exponent specifiers as a Utf8 instead of float. 
   
   As a sanity check, I tried the same thing in DuckDB, and their csv parser 
does not make this error. 
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   Either attempt to infer schema for a csv file containing the offending 
pattern (like I have done [here in this provided 
example](https://github.com/jdcasale/arrow-csv-parse-bug/blob/develop/src/main.rs))
 or just run the existing regex directly against the example offender: 
"3.106e+04", it will not match. 
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   The decimal regex recognizes "3.106e+04" as a float value, not a Utf8 string.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] DECIMAL regex in csv reader does not accept positive exponent specifier [arrow-rs]

Reply via email to