[ 
https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21795:
------------------------------
    Component/s: Documentation
     Issue Type: Question  (was: Bug)

I don't quite get the issue: you say it makes sense to not broadcast join when 
the DF is cached, but then say you might want to broadcast in some cases. 
What's the case where you want to?

> Broadcast hint ignored when dataframe is cached
> -----------------------------------------------
>
>                 Key: SPARK-21795
>                 URL: https://issues.apache.org/jira/browse/SPARK-21795
>             Project: Spark
>          Issue Type: Question
>          Components: Documentation, SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>            Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast 
> hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, 
> however, I wonder if it's the correct behavior for spark to ignore the 
> broadcast hint just because the DF is cached. Consider a case when a DF 
> should be cached for several queries, and on different queries it should be 
> broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached 
> DF cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to