[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865679#comment-16865679
 ] 

Stuart White edited comment on SPARK-28058 at 6/17/19 3:08 PM:
---------------------------------------------------------------

Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading 
Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], 
under the [Upgrading From Spark SQL 2.3 to 
2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
 section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least 
one column 
value in the row is malformed. CSV parser dropped such rows in the 
DROPMALFORMED mode or
outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered 
as malformed
only when it contains malformed column values requested from CSV datasource, 
other values
can be ignored. As an example, CSV file contains the “id,name” header and one 
row “1234”.
In Spark 2.4, selection of the id column consists of a row with one column 
value 1234 but
in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the 
previous
behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the 
{{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!


was (Author: stwhit):
Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading 
Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], 
under the [Upgrading From Spark SQL 2.3 to 
2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
 section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least 
one column value in the row is malformed. CSV parser dropped such rows in the 
DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, 
CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the “id,name” header and one row “1234”. In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the 
{{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> -----------------------------------------------------------------------
>
>                 Key: SPARK-28058
>                 URL: https://issues.apache.org/jira/browse/SPARK-28058
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.1, 2.4.3
>            Reporter: Stuart White
>            Priority: Minor
>              Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +------+------+-----+--------+                                                
>   
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +------+
> |fruit |
> +------+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +------+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +------+------+
> |fruit |color |
> +------+------+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +------+------+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +------+------+-----+
> |fruit |color |price|
> +------+------+-----+
> |apple |red   |1    |
> |banana|yellow|2    |
> |orange|orange|3    |
> |xxx   |null  |null |
> +------+------+-----+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +------+------+-----+--------+
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to