[GitHub] HyukjinKwon commented on a change in pull request #23746: [SPARK-26761][SQL][R] Vectorized R gapply() implementation

GitBox Sat, 09 Feb 2019 05:16:33 -0800

HyukjinKwon commented on a change in pull request #23746: [SPARK-26761][SQL][R] 
Vectorized R gapply() implementation
URL: https://github.com/apache/spark/pull/23746#discussion_r255298462


 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
 ##########
 @@ -437,6 +440,71 @@ case class FlatMapGroupsInRExec(
   }
 }
 
+/**
+ * Similar with [[FlatMapGroupsInRExec]] but serializes and deserializes 
input/output in
+ * Arrow format.
+ * This is also somewhat similar with
+ * [[org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec]].
+ */
+case class FlatMapGroupsInRWithArrowExec(
+    func: Array[Byte],
+    packageNames: Array[Byte],
+    broadcastVars: Array[Broadcast[Object]],
+    inputSchema: StructType,
+    output: Seq[Attribute],
+    keyDeserializer: Expression,
+    groupingAttributes: Seq[Attribute],
+    child: SparkPlan) extends UnaryExecNode {
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] =
+    if (groupingAttributes.isEmpty) {
+      AllTuples :: Nil
+    } else {
+      ClusteredDistribution(groupingAttributes) :: Nil
+    }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+    Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+    child.execute().mapPartitionsInternal { iter =>
+      val grouped = GroupedIterator(iter, groupingAttributes, child.output)
+      val getKey = ObjectOperator.deserializeRowToObject(keyDeserializer, 
groupingAttributes)
+      val runner = new ArrowRRunner(
+        func, packageNames, broadcastVars, inputSchema, 
SQLConf.get.sessionLocalTimeZone)
+
+      val groupedByRKey = grouped.map { case (key, rowIter) =>
+        val newKey = rowToRBytes(getKey(key).asInstanceOf[Row])
+        (newKey, rowIter)
+      }
+
+      // The communication mechanism is as follows:
+      //
+      //    JVM side                           R side
+      //
+      // 1. Group internal rows
+      // 2. Grouped internal rows    --------> Arrow record natches
+      // 3. Grouped keys             --------> Regular serialized keys
 
 Review comment:
   This protocol is different with Pandas's grouped map UDF one. Pandas grouped 
UDF sends arguments position to find the grouping keys but here I just decided 
to follow existing protocol in `gapply()`. Keys are relatively small so it 
won't affect the performance a lot anyway.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HyukjinKwon commented on a change in pull request #23746: [SPARK-26761][SQL][R] Vectorized R gapply() implementation

Reply via email to