[GitHub] [spark] MaxGekk opened a new pull request #35844: [SPARK-38523][SQL][3.2] Fix referring to the corrupt record column from CSV

GitBox Sun, 13 Mar 2022 23:34:51 -0700


MaxGekk opened a new pull request #35844:
URL: https://github.com/apache/spark/pull/35844



   ### What changes were proposed in this pull request?
   In the case when an user specifies the corrupt record column via the CSV 
option `columnNameOfCorruptRecord`:
   1. Disable the column pruning feature in the CSV parser.
   2. Don't push filters to `UnivocityParser` that refer to the "virtual" 
column `columnNameOfCorruptRecord`. Since the column cannot present in the 
input CSV, user's queries fail while compiling predicates. After the changes, 
the skipped filters are applied later on the upper layer.
   
   ### Why are the changes needed?
   The changes allow to refer to the corrupt record column from user's queries:
   
   ```Scala
   spark.read.format("csv")
     .option("header", "true")
     .option("columnNameOfCorruptRecord", "corrRec")
     .schema(schema)
     .load("csv_corrupt_record.csv")
     .filter($"corrRec".isNotNull)
     .show()
   ```
   for the input file "csv_corrupt_record.csv":
   ```
   0,2013-111_11 12:13:14
   1,1983-08-04 
   ```
   the query returns:
   ```
   +---+----+----------------------+
   |a  |b   |corrRec               |
   +---+----+----------------------+
   |0  |null|0,2013-111_11 12:13:14|
   +---+----+----------------------+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Before the changes, the query above fails with the exception:
   ```Java
   java.lang.IllegalArgumentException: _corrupt_record does not exist. 
Available: a, b
        at 
org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:310)
 ~[classes/:?]
   ```
   
   ### How was this patch tested?
   By running new CSV test:
   ```
   $ build/sbt "sql/testOnly *.CSVv1Suite"
   $ build/sbt "sql/testOnly *.CSVv2Suite"
   $ build/sbt "sql/testOnly *.CSVLegacyTimeParserSuite"
   ```
   
   Authored-by: Max Gekk <[email protected]>
   Signed-off-by: Wenchen Fan <[email protected]>
   (cherry picked from commit 959694271e30879c944d7fd5de2740571012460a)
   Signed-off-by: Max Gekk <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MaxGekk opened a new pull request #35844: [SPARK-38523][SQL][3.2] Fix referring to the corrupt record column from CSV

Reply via email to