GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21394
[SPARK-24329][SQL] Test for skipping multi-space lines
## What changes were proposed in this pull request?
The PR is a continue of https://github.com/apache/spark/pull/21380 . It
checks cases that are handled by the code:
https://github.com/apache/spark/blob/e3de6ab30d52890eb08578e55eb4a5d2b4e7aa35/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L303-L304
Basically the code skips lines with one or many whitespaces, and lines with
comments (see
[filterCommentAndEmpty](https://github.com/apache/spark/blob/e3de6ab30d52890eb08578e55eb4a5d2b4e7aa35/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala#L47))
```scala
iter.filter { line =>
line.trim.nonEmpty && !line.startsWith(options.comment.toString)
}
```
## How was this patch tested?
Added a test for the case described above.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 test-for-multi-space-lines
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21394.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21394
----
commit b0f73e5f5dda5ec74c91dad07f50f9960402cc82
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-05-22T11:59:51Z
Test checks skipping lines with comments, and one or multiple whitespaces
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]