Sean Owen resolved SPARK-21795.
    Resolution: Duplicate

> Broadcast hint ignored when dataframe is cached
> -----------------------------------------------
>                 Key: SPARK-21795
>                 URL: https://issues.apache.org/jira/browse/SPARK-21795
>             Project: Spark
>          Issue Type: Question
>          Components: Documentation, SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>            Priority: Minor
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast 
> hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, 
> however, I wonder if it's the correct behavior for spark to ignore the 
> broadcast hint just because the DF is cached. Consider a case when a DF 
> should be cached for several queries, and on different queries it should be 
> broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached 
> DF cannot be broadcasted.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to