[GitHub] felixcheung commented on a change in pull request #23746: [SPARK-26761][SQL][R] Vectorized R gapply() implementation

GitBox Tue, 12 Feb 2019 00:43:23 -0800

felixcheung commented on a change in pull request #23746: [SPARK-26761][SQL][R] 
Vectorized R gapply() implementation
URL: https://github.com/apache/spark/pull/23746#discussion_r255845030


 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
 ##########
 @@ -437,6 +440,71 @@ case class FlatMapGroupsInRExec(
   }
 }
 
+/**
+ * Similar with [[FlatMapGroupsInRExec]] but serializes and deserializes 
input/output in
+ * Arrow format.
+ * This is also somewhat similar with
+ * [[org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec]].
+ */
+case class FlatMapGroupsInRWithArrowExec(
+    func: Array[Byte],
+    packageNames: Array[Byte],
+    broadcastVars: Array[Broadcast[Object]],
+    inputSchema: StructType,
+    output: Seq[Attribute],
+    keyDeserializer: Expression,
+    groupingAttributes: Seq[Attribute],
+    child: SparkPlan) extends UnaryExecNode {
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] =
+    if (groupingAttributes.isEmpty) {
+      AllTuples :: Nil
+    } else {
+      ClusteredDistribution(groupingAttributes) :: Nil
+    }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+    Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+    child.execute().mapPartitionsInternal { iter =>
+      val grouped = GroupedIterator(iter, groupingAttributes, child.output)
+      val getKey = ObjectOperator.deserializeRowToObject(keyDeserializer, 
groupingAttributes)
+      val runner = new ArrowRRunner(
+        func, packageNames, broadcastVars, inputSchema, 
SQLConf.get.sessionLocalTimeZone)
+
+      val groupedByRKey = grouped.map { case (key, rowIter) =>
+        val newKey = rowToRBytes(getKey(key).asInstanceOf[Row])
+        (newKey, rowIter)
+      }
+
+      // The communication mechanism is as follows:
+      //
+      //    JVM side                           R side
+      //
+      // 1. Group internal rows
+      // 2. Grouped internal rows    --------> Arrow record natches
 
 Review comment:
   natches?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] felixcheung commented on a change in pull request #23746: [SPARK-26761][SQL][R] Vectorized R gapply() implementation

Reply via email to