HyukjinKwon commented on a change in pull request #23746: [SPARK-26761][SQL][R]
Vectorized R gapply() implementation
URL: https://github.com/apache/spark/pull/23746#discussion_r255298462
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala
##########
@@ -437,6 +440,71 @@ case class FlatMapGroupsInRExec(
}
}
+/**
+ * Similar with [[FlatMapGroupsInRExec]] but serializes and deserializes
input/output in
+ * Arrow format.
+ * This is also somewhat similar with
+ * [[org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec]].
+ */
+case class FlatMapGroupsInRWithArrowExec(
+ func: Array[Byte],
+ packageNames: Array[Byte],
+ broadcastVars: Array[Broadcast[Object]],
+ inputSchema: StructType,
+ output: Seq[Attribute],
+ keyDeserializer: Expression,
+ groupingAttributes: Seq[Attribute],
+ child: SparkPlan) extends UnaryExecNode {
+ override def outputPartitioning: Partitioning = child.outputPartitioning
+
+ override def producedAttributes: AttributeSet = AttributeSet(output)
+
+ override def requiredChildDistribution: Seq[Distribution] =
+ if (groupingAttributes.isEmpty) {
+ AllTuples :: Nil
+ } else {
+ ClusteredDistribution(groupingAttributes) :: Nil
+ }
+
+ override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+ Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+ override protected def doExecute(): RDD[InternalRow] = {
+ child.execute().mapPartitionsInternal { iter =>
+ val grouped = GroupedIterator(iter, groupingAttributes, child.output)
+ val getKey = ObjectOperator.deserializeRowToObject(keyDeserializer,
groupingAttributes)
+ val runner = new ArrowRRunner(
+ func, packageNames, broadcastVars, inputSchema,
SQLConf.get.sessionLocalTimeZone)
+
+ val groupedByRKey = grouped.map { case (key, rowIter) =>
+ val newKey = rowToRBytes(getKey(key).asInstanceOf[Row])
+ (newKey, rowIter)
+ }
+
+ // The communication mechanism is as follows:
+ //
+ // JVM side R side
+ //
+ // 1. Group internal rows
+ // 2. Grouped internal rows --------> Arrow record natches
+ // 3. Grouped keys --------> Regular serialized keys
Review comment:
This protocol is different with Pandas's grouped map UDF one. Pandas grouped
UDF sends arguments position to find the grouping keys but here I just decided
to follow existing protocol in `gapply()`. Keys are relatively small so it
won't affect the performance a lot anyway.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]