[PR] [SPARK-46890][SQL] Fix CSV parsing bug with existence default values and column pruning [spark]

via GitHub Mon, 29 Jan 2024 14:15:38 -0800


dtenedor opened a new pull request, #44939:
URL: https://github.com/apache/spark/pull/44939


   ### What changes were proposed in this pull request?
   
   This PR fixes a CSV parsing bug with existence default values and column 
pruning (https://issues.apache.org/jira/browse/SPARK-46890).
   
   The bug fix includes disabling column pruning specifically when checking the 
CSV header schema against the required schema expected by Catalyst. This makes 
the expected schema match what the CSV parser provides, since later we also 
happen instruct the CSV parser to disable column pruning and instead read each 
entire row in order to correctly assign the default value(s) during execution.
   
   ### Why are the changes needed?
   
   Before this change, queries from a subset of the columns in a CSV table 
whose `CREATE TABLE` statement contained default values would return an 
internal exception. For example:
   
   ```
   CREATE TABLE IF NOT EXISTS products (
     product_id INT,
     name STRING,
     price FLOAT default 0.0,
     quantity INT default 0
   )
   USING CSV
   OPTIONS (
     header 'true',
     inferSchema 'false',
     enforceSchema 'false',
     path '/Users/maximgekk/tmp/products.csv'
   );
   ```
   
   The CSV file products.csv:
   
   ```
   product_id,name,price,quantity
   1,Apple,0.50,100
   2,Banana,0.25,200
   3,Orange,0.75,50
   ```
   
   The query fails:
   
   ```
   spark-sql (default)> SELECT price FROM products;
   24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6)
   java.lang.IllegalArgumentException: Number of column in CSV header is not 
equal to number of fields in the schema:
    Header length: 4, schema size: 1
   CSV file: file:///Users/maximgekk/tmp/products.csv
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   This PR adds test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-46890][SQL] Fix CSV parsing bug with existence default values and column pruning [spark]

Reply via email to