Hi Abhnav, this sounds to me like a bad design, since it isn't scalable. Would it be possible to store all the data in a database like hbase/bigtable/cassandra? This would allow you to write the data from all the workers in parallel to the database/
Cheers, Fokko Op wo 27 nov. 2019 om 06:58 schreef Ranjan, Abhinav < abhinav.ranjan...@gmail.com>: > Hi all, > > I want to collect some rows in a list by using the spark's collect_list > function. > > However, the no. of rows getting in the list is overflowing the memory. Is > there any way to force the collection of rows onto the disk rather than in > memory, or else instead of collecting it as a list, collect it as a list of > list so as to avoid collecting it whole into the memory. > > *ex: df as:* > > *id col1 col2* > > 1 as sd > > 1 df fg > > 1 gh jk > > 2 rt ty > > *df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))* > > *id col3* > > 1 [(as,sd),(df,fg),(gh,jk)] > > 2 [(rt,ty)] > > > so if id=1 is having too much rows than the list will overflow. How to > avoid this scenario? > > > Thanks, > > Abhnav > > >