[ https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-28058: ------------------------------------ Assignee: (was: Apache Spark) > Reading csv with DROPMALFORMED sometimes doesn't drop malformed records > ----------------------------------------------------------------------- > > Key: SPARK-28058 > URL: https://issues.apache.org/jira/browse/SPARK-28058 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.1, 2.4.3 > Reporter: Stuart White > Priority: Minor > Labels: CSV, csv, csvparser > > The spark sql csv reader is not dropping malformed records as expected. > Consider this file (fruit.csv). Notice it contains a header record, 3 valid > records, and one malformed record. > {noformat} > fruit,color,price,quantity > apple,red,1,3 > banana,yellow,2,4 > orange,orange,3,5 > xxx > {noformat} > If I read this file using the spark sql csv reader as follows, everything > looks good. The malformed record is dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").show(truncate=false) > +------+------+-----+--------+ > > |fruit |color |price|quantity| > +------+------+-----+--------+ > |apple |red |1 |3 | > |banana|yellow|2 |4 | > |orange|orange|3 |5 | > +------+------+-----+--------+ > {noformat} > However, if I select a subset of the columns, the malformed record is not > dropped. The malformed data is placed in the first column, and the remaining > column(s) are filled with nulls. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) > +------+ > |fruit | > +------+ > |apple | > |banana| > |orange| > |xxx | > +------+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) > +------+------+ > |fruit |color | > +------+------+ > |apple |red | > |banana|yellow| > |orange|orange| > |xxx |null | > +------+------+ > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, > 'price).show(truncate=false) > +------+------+-----+ > |fruit |color |price| > +------+------+-----+ > |apple |red |1 | > |banana|yellow|2 | > |orange|orange|3 | > |xxx |null |null | > +------+------+-----+ > {noformat} > And finally, if I manually select all of the columns, the malformed record is > once again dropped. > {noformat} > scala> spark.read.option("header", "true").option("mode", > "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, > 'quantity).show(truncate=false) > +------+------+-----+--------+ > |fruit |color |price|quantity| > +------+------+-----+--------+ > |apple |red |1 |3 | > |banana|yellow|2 |4 | > |orange|orange|3 |5 | > +------+------+-----+--------+ > {noformat} > I would expect the malformed record(s) to be dropped regardless of which > columns are being selected from the file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org