[ https://issues.apache.org/jira/browse/SPARK-40468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang updated SPARK-40468: -------------------------------- Fix Version/s: 3.3.1 (was: 3.3.2) > Column pruning is not handled correctly in CSV when _corrupt_record is used > --------------------------------------------------------------------------- > > Key: SPARK-40468 > URL: https://issues.apache.org/jira/browse/SPARK-40468 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0, 3.2.2, 3.4.0 > Reporter: Ivan Sadikov > Assignee: Ivan Sadikov > Priority: Major > Labels: correctness > Fix For: 3.4.0, 3.3.1 > > > I have found that depending on the name of the corrupt record in CSV, the > field is populated incorrectly. Here is an example: > {code:java} > 1,a > /tmp/file.csv > === > val df = spark.read > .schema("c1 int, c2 string, x string, _corrupt_record string") > .csv("file:/tmp/file.csv") > .withColumn("x", lit("A")) > Result: > +---+---+---+---------------+ > |c1 |c2 |x |_corrupt_record| > +---+---+---+---------------+ > |1 |a |A |1,a | > +---+---+---+---------------+{code} > > However, if you rename the {{_corrupt_record}} column to something else, the > result is different: > {code:java} > val df = spark.read > .option("columnNameCorruptRecord", "corrupt_record") > .schema("c1 int, c2 string, x string, corrupt_record string") > .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) > Result: > +---+---+---+--------------+ > |c1 |c2 |x |corrupt_record| > +---+---+---+--------------+ > |1 |a |A |null | > +---+---+---+--------------+{code} > > This is due to inconsistency in CSVFileFormat, when enabling columnPruning, > we check SQLConf option for corrupt records but CSV reader relies on > {{columnNameCorruptRecord}} option instead. > Also, this disables column pruning which used to work in Spark version prior > to > https://github.com/apache/spark/commit/959694271e30879c944d7fd5de2740571012460a. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org