[
https://issues.apache.org/jira/browse/SPARK-40468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Sadikov updated SPARK-40468:
---------------------------------
Description:
I have found that depending on the name of the corrupt record in CSV, the field
is populated incorrectly. Here is an example:
{code:java}
1,a > /tmp/file.csv
===
val df = spark.read
.schema("c1 int, c2 string, x string, _corrupt_record string")
.csv("file:/tmp/file.csv")
.withColumn("x", lit("A"))
Result:
+---+---+---+---------------+
|c1 |c2 |x |_corrupt_record|
+---+---+---+---------------+
|1 |a |A |1,a |
+---+---+---+---------------+{code}
However, if you rename the {{_corrupt_record}} column to something else, the
result is different:
{code:java}
val df = spark.read
.option("columnNameCorruptRecord", "corrupt_record")
.schema("c1 int, c2 string, x string, corrupt_record string")
.csv("file:/tmp/file.csv") .withColumn("x", lit("A"))
Result:
+---+---+---+--------------+
|c1 |c2 |x |corrupt_record|
+---+---+---+--------------+
|1 |a |A |null |
+---+---+---+--------------+{code}
was:
I have found that depending on the name of the corrupt record in CSV, the field
is populated incorrectly. Here is an example:
{code:java}
1,a > /tmp/file.csv
===
val df = spark.read
.schema("c1 int, c2 string, x string, _corrupt_record string")
.csv("file:/tmp/file.csv")
.withColumn("x", lit("A"))
Returns:
+---+---+---+---------------+
|c1 |c2 |x |_corrupt_record|
+---+---+---+---------------+
|1 |a |A |1,a |
+---+---+---+---------------+{code}
> Column pruning is not handled correctly in CSV when _corrupt_record is used
> ---------------------------------------------------------------------------
>
> Key: SPARK-40468
> URL: https://issues.apache.org/jira/browse/SPARK-40468
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.3.0, 3.2.2, 3.4.0
> Reporter: Ivan Sadikov
> Priority: Major
>
> I have found that depending on the name of the corrupt record in CSV, the
> field is populated incorrectly. Here is an example:
> {code:java}
> 1,a > /tmp/file.csv
> ===
> val df = spark.read
> .schema("c1 int, c2 string, x string, _corrupt_record string")
> .csv("file:/tmp/file.csv")
> .withColumn("x", lit("A"))
> Result:
> +---+---+---+---------------+
> |c1 |c2 |x |_corrupt_record|
> +---+---+---+---------------+
> |1 |a |A |1,a |
> +---+---+---+---------------+{code}
>
> However, if you rename the {{_corrupt_record}} column to something else, the
> result is different:
> {code:java}
> val df = spark.read
> .option("columnNameCorruptRecord", "corrupt_record")
> .schema("c1 int, c2 string, x string, corrupt_record string")
> .csv("file:/tmp/file.csv") .withColumn("x", lit("A"))
> Result:
> +---+---+---+--------------+
> |c1 |c2 |x |corrupt_record|
> +---+---+---+--------------+
> |1 |a |A |null |
> +---+---+---+--------------+{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]