[
https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671401#comment-15671401
]
Herman van Hovell commented on SPARK-17662:
-------------------------------------------
This is more of a question for the user list or stack overflow. So I am closing
this.
BTW: I would use max, for example:
{noformat}
select user_id,
action_type,
max(struct(date, *)) last_record
from tbl
group by 1,2
{noformat}
> Dedup UDAF
> ----------
>
> Key: SPARK-17662
> URL: https://issues.apache.org/jira/browse/SPARK-17662
> Project: Spark
> Issue Type: New Feature
> Reporter: Ohad Raviv
>
> We have a common use case od deduping a table in a creation order.
> For example, we have an event log of user actions. A user marks his favorite
> category from time to time.
> In our analytics we would like to know only the user's last favorite category.
> The data:
> user_id action_type value date
> 123 fav category 1 2016-02-01
> 123 fav category 4 2016-02-02
> 123 fav category 8 2016-02-03
> 123 fav category 2 2016-02-04
> we would like to get only the last update by the date column.
> we could of-course do it in sql:
> select * from (
> select *, row_number() over (partition by user_id,action_type order by date
> desc) as rnum from tbl)
> where rnum=1;
> but then, I believe it can't be optimized on the mappers side and we'll get
> all the data shuffled to the reducers instead of partially aggregated in the
> map side.
> We have written a UDAF for this, but then we have other issues - like
> blocking push-down-predicate for columns.
> do you have any idea for a proper solution?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]