[GitHub] [spark] revans2 commented on a change in pull request #29067: [SPARK-32274][SQL] Make SQL cache serialization pluggable

GitBox Thu, 30 Jul 2020 06:42:20 -0700


revans2 commented on a change in pull request #29067:
URL: https://github.com/apache/spark/pull/29067#discussion_r463004283




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
##########
@@ -130,34 +87,29 @@ case class InMemoryTableScanExec(
     val numOutputRows = longMetric("numOutputRows")
     // Using these variables here to avoid serialization of entire objects (if 
referenced
     // directly) within the map Partitions closure.
-    val relOutput: AttributeSeq = relation.output
-
-    filteredCachedBatches().mapPartitionsInternal { cachedBatchIterator =>
-      // Find the ordinals and data types of the requested columns.
-      val (requestedColumnIndices, requestedColumnDataTypes) =
-        attributes.map { a =>
-          relOutput.indexOf(a.exprId) -> a.dataType
-        }.unzip
+    val relOutput = relation.output
+    val serializer = relation.cacheBuilder.serializer
 
-      // update SQL metrics
-      val withMetrics = cachedBatchIterator.map { batch =>
+    // update SQL metrics
+    val withMetrics =
+      filteredCachedBatches().map{ batch =>
         if (enableAccumulatorsForTest) {
           readBatches.add(1)
         }
         numOutputRows += batch.numRows
         batch
       }
-
-      val columnTypes = requestedColumnDataTypes.map {
-        case udt: UserDefinedType[_] => udt.sqlType
-        case other => other
-      }.toArray
-      val columnarIterator = GenerateColumnAccessor.generate(columnTypes)
-      columnarIterator.initialize(withMetrics, columnTypes, 
requestedColumnIndices.toArray)
-      if (enableAccumulatorsForTest && columnarIterator.hasNext) {
-        readPartitions.add(1)
+    val rows = serializer.convertCachedBatchToInternalRow(withMetrics, 
relOutput, attributes, conf)
+    if (enableAccumulatorsForTest) {

Review comment:
       The original code has it appear three times just like this code does.
   
   1) Zero out the metrics
   
https://github.com/apache/spark/blob/743772095273b464f845efefb3eb59284b06b9be/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L125
   
   2) check if it should update `readBatches`
   
https://github.com/apache/spark/blob/743772095273b464f845efefb3eb59284b06b9be/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L144
   
   3) check if it should update `readPartitions`
   
https://github.com/apache/spark/blob/743772095273b464f845efefb3eb59284b06b9be/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L157
   
   I cannot get rid of those three.
   
   I did rearrange things so there is only one `mapPartitionsInternal` call 
(which is what I think you are really asking for), but to do it I had to move 
the `readPartitions` check from being after deserialization to before it. The 
issue is that the high level API we have for deserialization takes an `RDD` and 
returns an `RDD` so I cannot  do what the original code did because the API 
they used took an Iterator and returned an Iterator.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] revans2 commented on a change in pull request #29067: [SPARK-32274][SQL] Make SQL cache serialization pluggable

Reply via email to