Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/21296#discussion_r187499921
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
---
@@ -73,11 +64,24 @@ class UnivocityParser(
// Each input token is placed in each output row's position by mapping
these. In this case,
//
// output row - ["A", 2]
- private val valueConverters: Array[ValueConverter] =
- schema.map(f => makeConverter(f.name, f.dataType, f.nullable,
options)).toArray
+ private val valueConverters: Array[ValueConverter] = {
+ requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable,
options)).toArray
+ }
- private val tokenIndexArr: Array[Int] = {
- requiredSchema.map(f => schema.indexOf(f)).toArray
+ private val tokenizer = {
+ val parserSetting = options.asParserSettings
+ if (requiredSchema.length < schema.length) {
+ val tokenIndexArr = requiredSchema.map(f =>
java.lang.Integer.valueOf(schema.indexOf(f)))
+ parserSetting.selectIndexes(tokenIndexArr: _*)
--- End diff --
I think I tried this locally but I didn't submit a PR since the improvement
was trivial and a test was broken fwiw.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]