andygrove opened a new issue, #3232:
URL: https://github.com/apache/datafusion-comet/issues/3232

   ## Summary
   
   During review of PR #3004 (which adds basic `to_csv` support), fuzz testing 
revealed several edge cases that are not handled correctly. These should be 
addressed in follow-up work after the initial implementation is merged.
   
   ## Bugs Found
   
   ### 1. Null value not quoted when it contains special characters
   When the `nullValue` option contains the delimiter or other special 
characters (e.g., `"N,A"`), it's written unquoted, corrupting the CSV output.
   
   | Expected (Spark) | Actual (Comet) |
   |------------------|----------------|
   | `"N,A",world` | `N,A,world` |
   | `hello,"N,A"` | `hello,N,A` |
   
   **Location:** `native/spark-expr/src/csv_funcs/to_csv.rs:164-171`
   
   **Fix:** Check if `null_value` contains special characters and quote/escape 
it appropriately.
   
   ### 2. Whitespace trimming applied incorrectly  
   When `ignoreLeadingWhiteSpace=false` or `ignoreTrailingWhiteSpace=false`, 
strings containing whitespace plus special characters are incorrectly handled. 
The code trims whitespace before checking if quoting is needed.
   
   | Expected (Spark) | Actual (Comet) |
   |------------------|----------------|
   | `  \"` (preserved whitespace with escaped quote) | `""` (empty) |
   
   **Location:** `native/spark-expr/src/csv_funcs/to_csv.rs:176-183`
   
   **Fix:** Review the order of operations - quoting determination should 
consider the original (untrimmed) value.
   
   ### 3. Decimal formatting mismatch
   Spark uses scientific notation for small decimal values, while Comet uses 
fixed-point notation.
   
   | Expected (Spark) | Actual (Comet) |
   |------------------|----------------|
   | `0E-18` | `0.000000000000000000` |
   
   **Fix:** Align decimal-to-string casting with Spark's formatting behavior.
   
   ### 4. NPE with single-column struct (needs investigation)
   `NullPointerException` occurs when processing single-column structs with 
certain null patterns. This may be a Spark-side issue with how Comet's output 
is handled, but needs investigation.
   
   ## Reproduction
   
   Fuzz tests were added in `CometCsvExpressionSuite.scala` that reproduce 
these issues:
   - `to_csv - edge case: delimiter in null value representation`
   - `to_csv - fuzz test: comprehensive random data and options`  
   - `to_csv - edge case: numeric boundary values`
   - `to_csv - edge case: single column struct`
   
   ## Related
   
   - PR #3004 - Initial `to_csv` implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to