David Hodeffi created SPARK-18944:
-------------------------------------

             Summary: Understanding BroadcastNestedLoopJoin and number of 
partitions
                 Key: SPARK-18944
                 URL: https://issues.apache.org/jira/browse/SPARK-18944
             Project: Spark
          Issue Type: Question
          Components: SQL
    Affects Versions: 2.0.2, 1.6.2
         Environment: Spark 1.6.2
            Reporter: David Hodeffi
            Priority: Trivial


I have two dataframes which I am joining. small and big  size dataframess. The 
optimizer  suggest to use BroadcastNestedLoopJoin. 
number of partitions for the big dataframe is 200 while small dataframe has 5 
partitions.
The joined dataframe  results with 205 partitions (joined.rdd.partitions.size), 
I have tried to understand  why is this number and figured out that 
BroadCastNestedLoopJoin is actually a union. 

code : 
case class BroadcastNestedLoopJoin{
   def doExecuteo(): ={
        ...
        ...
       sparkContext.union(
            matchedStreamRows,
            sparkContext.makeRDD(notMatchedBroadcastRows)
      )
  }

}


can someone explain what exactly the code of doExecute() do?  can you elaborate 
about all the null checks and why can we have nulls ? Why do we have 205 
partions? link to a JIRA with discussion that can explain the code can help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to