Lior Chaga created SPARK-21795:
----------------------------------

             Summary: Broadcast hint ignored when dataframe is cached
                 Key: SPARK-21795
                 URL: https://issues.apache.org/jira/browse/SPARK-21795
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
            Reporter: Lior Chaga
            Priority: Minor


Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint 
is ignored, and spark uses SortMergeJoin.

{{code}}
val largeDf = ...
val smalDf = ...
smallDf = smallDf.cache

largeDf.join(broadcast(smallDf))

{{code}}

It make sense there's no need to use cache when using broadcast join, however, 
I wonder if it's the correct behavior for spark to ignore the broadcast hint 
just because the DF is cached. Consider a case when a DF should be cached for 
several queries, and on different queries it should be broadcasted.

If this is the correct behavior, at least it's worth documenting that cached DF 
cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to