override collect_list

Ranjan, Abhinav Tue, 26 Nov 2019 21:58:45 -0800

Hi all,

I want to collect some rows in a list by using the spark's collect_listfunction.

However, the no. of rows getting in the list is overflowing the memory.Is there any way to force the collection of rows onto the disk ratherthan in memory, or else instead of collecting it as a list, collect itas a list of list so as to avoid collecting it whole into the memory.


*_/ex: df as:/_*

*id        col1    col2*

1        as        sd

1        df        fg

1        gh        jk

2        rt        ty

*_/df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))/_*

*id        col3*

1        [(as,sd),(df,fg),(gh,jk)]

2        [(rt,ty)]

so if id=1 is having too much rows than the list will overflow. How toavoid this scenario?



Thanks,

Abhnav

override collect_list

Reply via email to