[GitHub] [spark] msamirkhan commented on a change in pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

GitBox Wed, 05 Aug 2020 21:53:07 -0700


msamirkhan commented on a change in pull request #29353:
URL: https://github.com/apache/spark/pull/29353#discussion_r466144939




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
##########
@@ -73,135 +75,157 @@ class OrcDeserializer(
    * Creates a writer to write ORC values to Catalyst data structure at the 
given ordinal.
    */
   private def newWriter(
-      dataType: DataType, updater: CatalystDataUpdater): (Int, 
WritableComparable[_]) => Unit =
+      dataType: DataType, reuseObj: Boolean)
+  : (CatalystDataUpdater, Int, WritableComparable[_]) => Unit =
     dataType match {
-      case NullType => (ordinal, _) =>
+      case NullType => (updater, ordinal, _) =>

Review comment:
       Only the top level field writers are currently being created once. For 
each level below, they are created for each data point. Eg, the case for struct 
did the following:
   
   `case st: StructType => (ordinal, value) =>`
   `...`
   `    val fieldConverters = st.map(_.dataType).map { dt => newWriter(dt, 
fieldUpdater) }.toArray`
   `...`
   
   i.e., it returns a function, which when called will create relevant 
fieldConverters. Instead, the field converters should be created outside the 
function so that they are created only once.
   
   Now, the updaters are not always reused. Eg, under an array or a map (i.e., 
whenever reuseObj turns to false).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] msamirkhan commented on a change in pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs

Reply via email to