[GitHub] spark pull request #19271: [SPARK-22053][SS] Stream-stream inner join in App...

brkyvz Tue, 19 Sep 2017 14:03:10 -0700

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19271#discussion_r139804087
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala
 ---
    @@ -0,0 +1,330 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.streaming
    +
    +import java.util.concurrent.TimeUnit.NANOSECONDS
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression, 
JoinedRow, NamedExpression, UnsafeProjection, UnsafeRow}
    +import org.apache.spark.sql.catalyst.plans._
    +import org.apache.spark.sql.catalyst.plans.logical.EventTimeWatermark._
    +import org.apache.spark.sql.catalyst.plans.physical._
    +import org.apache.spark.sql.execution.{BinaryExecNode, SparkPlan}
    +import 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExecHelper._
    +import org.apache.spark.sql.execution.streaming.state._
    +import org.apache.spark.sql.internal.SessionState
    +import org.apache.spark.util.{CompletionIterator, 
SerializableConfiguration}
    +
    +
    +/**
    + * Performs stream-stream join using symmetric hash join algorithm. It 
works as follows.
    + *
    + *                             /-----------------------\
    + *   left side input --------->|    left side state    |------\
    + *                             \-----------------------/      |
    + *                                                            |--------> 
joined output
    + *                             /-----------------------\      |
    + *   right side input -------->|    right side state   |------/
    + *                             \-----------------------/
    + *
    + * Each join side buffers past input rows as streaming state so that the 
past input can be joined
    + * with future input on the other side. This buffer state is effectively a 
multi-map:
    + *    equi-join key -> list of past input rows received with the join key
    + *
    + * For each input row in each side, the following operations take place.
    + * - Calculate join key from the row.
    + * - Use the join key to append the row to the buffer state of the side 
that the row came from.
    + * - Find past buffered values for the key from the other side. For each 
such value, emit the
    + *   "joined row" (left-row, right-row)
    + * - Apply the optional condition to filter the joined rows as the final 
output.
    + *
    + * If a timestamp column with event time watermark is present in the join 
keys or in the input
    + * data, then the it uses the watermark figure out which rows in the 
buffer will not join with
    + * and new data, and therefore can be discarded. Depending on the provided 
query conditions, we
    --- End diff --
    
    nit: `with the new data`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19271: [SPARK-22053][SS] Stream-stream inner join in App...

Reply via email to