Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21296#discussion_r187604963
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
---
@@ -267,7 +267,7 @@ class CSVSuite extends QueryTest with SharedSQLContext
with SQLTestUtils with Te
.options(Map("header" -> "true", "mode" -> "dropmalformed"))
.load(testFile(carsFile))
- assert(cars.select("year").collect().size === 2)
+ assert(cars.collect().size === 2)
--- End diff --
> it's intendedly parsed to keep the backword compatibility.
Right, by selecting all columns I force *UnivocityParser* to fall to the
case:
https://github.com/MaxGekk/spark-1/blob/a4a0a549156a15011c33c7877a35f244d75b7a4f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L193-L213
when number of returned tokens are less than required.
In the case of `cars.select("year")`, uniVocity parser returns only one
token as it is expected.
> There was an issue about the different number of counts.
The PR changes behavior for some malformed inputs but I believe we could
provide better performance for users who have correct inputs.
> I think you are basically saying cars.select("year").collect().size and
cars.collect().size are different and they are correct, right?
Yes, you can say that. You are right it seems the PR proposes another
interpretation for malformed rows. `cars.select("year")` is:
```
+----+
|year|
+----+
|2012|
|1997|
|2015|
+----+
```
and we should not reject `2015` only because there are problems in not
requested columns. In this particular case, the last row consists of only one
value at `0` position and it is correct.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]