David Hodeffi created SPARK-18944:
-------------------------------------
Summary: Understanding BroadcastNestedLoopJoin and number of
partitions
Key: SPARK-18944
URL: https://issues.apache.org/jira/browse/SPARK-18944
Project: Spark
Issue Type: Question
Components: SQL
Affects Versions: 2.0.2, 1.6.2
Environment: Spark 1.6.2
Reporter: David Hodeffi
Priority: Trivial
I have two dataframes which I am joining. small and big size dataframess. The
optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big dataframe is 200 while small dataframe has 5
partitions.
The joined dataframe results with 205 partitions (joined.rdd.partitions.size),
I have tried to understand why is this number and figured out that
BroadCastNestedLoopJoin is actually a union.
code :
case class BroadcastNestedLoopJoin{
def doExecuteo(): ={
...
...
sparkContext.union(
matchedStreamRows,
sparkContext.makeRDD(notMatchedBroadcastRows)
)
}
}
can someone explain what exactly the code of doExecute() do? can you elaborate
about all the null checks and why can we have nulls ? Why do we have 205
partions? link to a JIRA with discussion that can explain the code can help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]