Lior Chaga created SPARK-21795:
----------------------------------
Summary: Broadcast hint ignored when dataframe is cached
Key: SPARK-21795
URL: https://issues.apache.org/jira/browse/SPARK-21795
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.0
Reporter: Lior Chaga
Priority: Minor
Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint
is ignored, and spark uses SortMergeJoin.
{{code}}
val largeDf = ...
val smalDf = ...
smallDf = smallDf.cache
largeDf.join(broadcast(smallDf))
{{code}}
It make sense there's no need to use cache when using broadcast join, however,
I wonder if it's the correct behavior for spark to ignore the broadcast hint
just because the DF is cached. Consider a case when a DF should be cached for
several queries, and on different queries it should be broadcasted.
If this is the correct behavior, at least it's worth documenting that cached DF
cannot be broadcasted.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]