Re: pyspark pickle error when using itertools.groupby

2016-08-05 Thread Eike von Seggern
Hello, `itertools.groupby` is evaluated lazily and the `g`s in your code are generators not lists. This might cause your problem. Casting everything to lists might help here, e.g.: grp2 = [(k, list(g)) for k,g in groupby(grp1, lambda e: e[1])] HTH Eike 2016-08-05 7:31 GMT+02:00 林家銘

pyspark pickle error when using itertools.groupby

2016-08-04 Thread 林家銘
Hi I wrote a map function to aggregate data in a partition, and this function using itertools.groupby for more than twice, then there comes the pickle error . Here is what I do ===Driver Code=== pair_count = df.mapPartitions(lambda iterable: pair_func_cnt(iterable)) pair_count.collection()