Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/22880#discussion_r231249401
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
---
@@ -202,11 +204,15 @@ private[parquet] class ParquetRowConverter(
override def start(): Unit = {
var i = 0
- while (i < currentRow.numFields) {
+ while (i < fieldConverters.length) {
fieldConverters(i).updater.start()
currentRow.setNullAt(i)
--- End diff --
> I'm going to push a new commit keeping the current code but with a brief
explanatory comment.
On further careful consideration, I believe that separating the calls to
`currentRow.setNullAt(i)` into their own loop actually won't incur any
significant performance degradationâif any at all.
The performance of the `start()` method is dominated by the calls to
`fieldConverters(i).updater.start()` and `currentRow.setNullAt(i)`. Putting the
latter calls into their own loop won't change the count of those method calls,
just the order. @viirya LMK if you disagree with my analysis.
I will push a new commit with separate while loops. I won't use the more
elegant `(0 until currentRow.numFields).foreach(currentRow.setNullAt)` because
that's not a loop, and I doubt either the Spark or Hotspot optimizer can turn
that into a loop.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]