[
https://issues.apache.org/jira/browse/SPARK-24426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-24426.
----------------------------------
Resolution: Incomplete
> Unexpected combination of cache and join on DataFrame
> -----------------------------------------------------
>
> Key: SPARK-24426
> URL: https://issues.apache.org/jira/browse/SPARK-24426
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.0
> Reporter: Krzysztof Skulski
> Priority: Major
> Labels: bulk-closed
>
> I have unexpected results, when I cache DataFrame and try to do another
> grouping on it. New DataFrames based on cached groupBy DataFrame works ok,
> but when i try join it to anohter DataFrame it seems like second join is
> adding new column but the data is copy from first joined DataFrame. Example
> below (userAgentType - is ok,
> userChannelType - is ok, userOrigin - is not ok).
> When I remove cache from aggregated DataFrame it works ok.
>
> {code:scala}
> val aggregated = dataFrame.cache()
> val userAgentType = aggregated.groupBy("id", "agentType").count()
> .orderBy(asc("id"),
> desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
> val userChannelType = aggregated.groupBy("id", "channelType").count()
> .orderBy(asc("id"),
> desc("count")).groupBy("id").agg(first("channelType").as("channelType"))
> val userOrigin = userInfo
> .join(userAgentType, Seq("id"), "left")
> .join(userChannelType, Seq("id"), "left")
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]