What's the issue with groupByKey()? On Mon, Oct 19, 2015 at 1:11 AM, fahad shah <sfaha...@gmail.com> wrote: > Hi > > I wanted to ask whats the best way to achieve per key auto increment > numerals after sorting, for eg. : > > raw file: > > 1,a,b,c,1,1 > 1,a,b,d,0,0 > 1,a,b,e,1,0 > 2,a,e,c,0,0 > 2,a,f,d,1,0 > > post-output (the last column is the position number after grouping on > first three fields and reverse sorting on last two values) > > 1,a,b,c,1,1,1 > 1,a,b,d,0,0,3 > 1,a,b,e,1,0,2 > 2,a,e,c,0,0,2 > 2,a,f,d,1,0,1 > > I am using solution that uses groupbykey but that is running into some > issues (possibly bug with pyspark/spark?), wondering if there is a > better way to achieve this. > > My solution: > > A = A = sc.textFile("train.csv").filter(lambda x:not > isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is > None) > > B = A.map(lambda k: > ((k.first_field,k.second_field,k.first_field,k.third_field), > (k[0:5]))).groupByKey() > > B.map(sort_n_set_position).flatMap(lambda line: line) > > where sort and set position iterates over the iterator and performs > sorting and adding last column. > > best fahad > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org