bersprockets commented on a change in pull request #23336: [SPARK-26378][SQL]
Restore performance of queries against wide CSV tables
URL: https://github.com/apache/spark/pull/23336#discussion_r242384016
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
##########
@@ -40,14 +40,20 @@ class FailureSafeParser[IN](
// set the bad record to this field, and set other fields according to the
partial result or null.
private val toResultRow: (Option[InternalRow], () => UTF8String) =>
InternalRow = {
(row, badRecord) => {
- var i = 0
- while (i < actualSchema.length) {
- val from = actualSchema(i)
- resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i,
from.dataType)).orNull
- i += 1
+ // save the value. Some implementations of badRecord do not like to be
called twice
+ val badRec = badRecord()
Review comment:
Actually, with the fix for
[SPARK-26372](https://issues.apache.org/jira/browse/SPARK-26372), which fills
in null for bad or missing fields, I wonder if we need to rewrite the row at
all for bad records if schema == actualSchema.
There's probably an obvious reason we need to rewrite that I've forgotten,
but I will investigate if it's needed (which could save some more time).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]