Hi all,

I want to collect some rows in a list by using the spark's collect_list function.

However, the no. of rows getting in the list is overflowing the memory. Is there any way to force the collection of rows onto the disk rather than in memory, or else instead of collecting it as a list, collect it as a list of list so as to avoid collecting it whole into the memory.

*_/ex: df as:/_*

*id        col1    col2*

1        as        sd

1        df        fg

1        gh        jk

2        rt        ty

*_/df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))/_*

*id        col3*

1        [(as,sd),(df,fg),(gh,jk)]

2        [(rt,ty)]


so if id=1 is having too much rows than the list will overflow. How to avoid this scenario?


Thanks,

Abhnav


Reply via email to