Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21296#discussion_r187426203
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
---
@@ -267,7 +267,7 @@ class CSVSuite extends QueryTest with SharedSQLContext
with SQLTestUtils with Te
.options(Map("header" -> "true", "mode" -> "dropmalformed"))
.load(testFile(carsFile))
- assert(cars.select("year").collect().size === 2)
+ assert(cars.collect().size === 2)
--- End diff --
The `cars.csv` file has header with 5 columns:
```
year,make,model,comment,blank
```
and 2 rows with 4 valid columns and the last one is blank:
```
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they are going fast",
```
and one more row with only with 3 columns:
```
2015,Chevy,Volt
```
Previous (current) implementation drops the last row in the `dropmalformed`
mode because it parses whole rows, and the last one is incorrect. If only the
`year` column is selected, uniVocity parser returns values for first column
(with index `0`) and doesn't analyze correctness of the rest part of the rows.
So in this way `cars.select("year").collect().size` returns `3`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]