[GitHub] [spark] msamirkhan commented on a change in pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

GitBox Thu, 06 Aug 2020 00:07:05 -0700


msamirkhan commented on a change in pull request #29353:
URL: https://github.com/apache/spark/pull/29353#discussion_r466144939




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -73,135 +75,157 @@ class OrcDeserializer(
    * Creates a writer to write ORC values to Catalyst data structure at the 
given ordinal.
    */
   private def newWriter(
-      dataType: DataType, updater: CatalystDataUpdater): (Int, 
WritableComparable[_]) => Unit =
+      dataType: DataType, reuseObj: Boolean)
+  : (CatalystDataUpdater, Int, WritableComparable[_]) => Unit =
     dataType match {
-      case NullType => (ordinal, _) =>
+      case NullType => (updater, ordinal, _) =>

Review comment:
       Only the top level field writers are currently being created once. For 
each level below, they are created for each data point. Eg, the case for struct 
did the following:
   
   `case st: StructType => (ordinal, value) =>`
   `...`
   `    val fieldConverters = st.map(_.dataType).map { dt => newWriter(dt, 
fieldUpdater) }.toArray`
   `...`
   
   i.e., it returns a function, which when called will create relevant 
fieldConverters. Instead, the field converters should be created outside the 
function so that they are created only once.
   
   Now, the updaters can not always be reused. Eg, under an array or a map 
(i.e., whenever reuseObj turns to false).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] msamirkhan commented on a change in pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

Reply via email to