[GitHub] [spark] vinodkc commented on a diff in pull request #41875: [SPARK-44317][SQL] Use PartitionEvaluator API in ShuffledHashJoinExec

via GitHub Thu, 13 Jul 2023 09:03:22 -0700


vinodkc commented on code in PR #41875:
URL: https://github.com/apache/spark/pull/41875#discussion_r1262768573



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala:
##########
@@ -766,4 +598,331 @@ object HashJoin extends CastSupport with SQLConfHelper {
         ansiEnabled = false)
     }
   }
+
+  private def streamedBoundKeys(streamedKeys: Seq[Expression], streamedOutput: 
Seq[Attribute]) =
+    bindReferences(HashJoin.rewriteKeyExpr(streamedKeys), streamedOutput)
+  private def streamSideKeyGenerator(
+      streamedKeys: Seq[Expression],
+      streamedOutput: Seq[Attribute]): UnsafeProjection =
+    UnsafeProjection.create(streamedBoundKeys(streamedKeys, streamedOutput))
+
+  def boundCondition(
+      condition: Option[Expression],
+      joinType: JoinType,
+      buildSide: BuildSide,
+      buildPlanOutput: Seq[Attribute],
+      streamedPlanOutput: Seq[Attribute]): InternalRow => Boolean = if 
(condition.isDefined) {
+    if (joinType == FullOuter && buildSide == BuildLeft) {
+      // Put join left side before right side. This is to be consistent with
+      // `ShuffledHashJoinExec.fullOuterJoin`.
+      Predicate.create(condition.get, buildPlanOutput ++ 
streamedPlanOutput).eval _
+    } else {
+      Predicate.create(condition.get, streamedPlanOutput ++ 
buildPlanOutput).eval _
+    }
+  } else { (r: InternalRow) =>
+    true
+  }
+
+  private def createResultProjection(
+      joinType: JoinType,
+      output: Seq[Attribute],
+      buildPlanOutput: Seq[Attribute],
+      streamedPlanOutput: Seq[Attribute]): (InternalRow) => InternalRow = {
+    joinType match {
+      case LeftExistence(_) =>
+        UnsafeProjection.create(output, output)
+      case _ =>
+        // Always put the stream side on left to simplify implementation
+        // both of left and right side could be null
+        UnsafeProjection.create(
+          output, (streamedPlanOutput ++ 
buildPlanOutput).map(_.withNullability(true)))
+    }
+  }
+  def join(hashJoinParams: HashJoinParams): Iterator[InternalRow] = {
+
+    val streamedIter: Iterator[InternalRow] = hashJoinParams.streamedIter
+    val hashed: HashedRelation = hashJoinParams.hashedRelation
+    val streamedKeys: Seq[Expression] = hashJoinParams.streamedKeys
+    val streamedOutput: Seq[Attribute] = hashJoinParams.streamedOutput
+    val condition: Option[Expression] = hashJoinParams.condition
+    val joinType: JoinType = hashJoinParams.joinType
+    val buildSide: BuildSide = hashJoinParams.buildSide
+    val buildPlanOutput: Seq[Attribute] = hashJoinParams.buildPlanOutput
+    val streamedPlanOutput: Seq[Attribute] = hashJoinParams.streamedPlanOutput
+    val output: Seq[Attribute] = hashJoinParams.output
+    val numOutputRows: SQLMetric = hashJoinParams.numOutputRows
+
+    val joinedIter = joinType match {
+      case _: InnerLike =>
+        innerJoin(
+          streamedIter,
+          hashed,
+          streamedKeys,
+          streamedOutput,
+          condition,
+          joinType,
+          buildSide,
+          buildPlanOutput,
+          streamedPlanOutput)

Review Comment:
   In order to avoid passing lambda to join() method, I refactored join methods 
and moved them to  `object HashJoin....`, so those new object methods  (e.g 
`innerJoin` )need to be called with additional parameters. Hence I'm not 
finding a way to keep the original signature of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] vinodkc commented on a diff in pull request #41875: [SPARK-44317][SQL] Use PartitionEvaluator API in ShuffledHashJoinExec

Reply via email to