What's the issue with groupByKey()?

On Mon, Oct 19, 2015 at 1:11 AM, fahad shah <sfaha...@gmail.com> wrote:
> Hi
>
> I wanted to ask whats the best way to achieve per key auto increment
> numerals after sorting, for eg. :
>
> raw file:
>
> 1,a,b,c,1,1
> 1,a,b,d,0,0
> 1,a,b,e,1,0
> 2,a,e,c,0,0
> 2,a,f,d,1,0
>
> post-output (the last column is the position number after grouping on
> first three fields and reverse sorting on last two values)
>
> 1,a,b,c,1,1,1
> 1,a,b,d,0,0,3
> 1,a,b,e,1,0,2
> 2,a,e,c,0,0,2
> 2,a,f,d,1,0,1
>
> I am using solution that uses groupbykey but that is running into some
> issues (possibly bug with pyspark/spark?), wondering if there is a
> better way to achieve this.
>
> My solution:
>
> A = A = sc.textFile("train.csv").filter(lambda x:not
> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
> None)
>
> B = A.map(lambda k:
> ((k.first_field,k.second_field,k.first_field,k.third_field),
> (k[0:5]))).groupByKey()
>
> B.map(sort_n_set_position).flatMap(lambda line: line)
>
> where sort and set position iterates over the iterator and performs
> sorting and adding last column.
>
> best fahad
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to