[ 
https://issues.apache.org/jira/browse/SPARK-24426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24426.
----------------------------------
    Resolution: Incomplete

> Unexpected combination of cache and join on DataFrame
> -----------------------------------------------------
>
>                 Key: SPARK-24426
>                 URL: https://issues.apache.org/jira/browse/SPARK-24426
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Krzysztof Skulski
>            Priority: Major
>              Labels: bulk-closed
>
> I have unexpected results, when I cache DataFrame and try to do another 
> grouping on it.  New DataFrames based on cached groupBy DataFrame works ok, 
> but when i try join it to anohter DataFrame it seems like second join is 
> adding new column but the data is copy from first joined DataFrame. Example 
> below (userAgentType - is ok,
>  userChannelType - is ok, userOrigin - is not ok). 
>  When I remove cache from aggregated DataFrame it works ok.
>  
> {code:scala}
>  val aggregated = dataFrame.cache()
>  val userAgentType = aggregated.groupBy("id", "agentType").count()
>    .orderBy(asc("id"), 
> desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
>  val userChannelType = aggregated.groupBy("id", "channelType").count()
>    .orderBy(asc("id"), 
> desc("count")).groupBy("id").agg(first("channelType").as("channelType"))
> val userOrigin =  userInfo
>    .join(userAgentType, Seq("id"), "left")
>    .join(userChannelType, Seq("id"), "left")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to