Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Thanks Davies,

groupbykey was throwing up the error: unpack requires a string
argument of length 4

interestingly, I replace that with the sortbykey (which i read also
shuffles so that data for same key are on same partition) and it ran
fine - wondering if this a bug on groupbykey for Spark 1.3?

best fahad

On Mon, Oct 19, 2015 at 10:45 AM, Davies Liu  wrote:
> What's the issue with groupByKey()?
>
> On Mon, Oct 19, 2015 at 1:11 AM, fahad shah  wrote:
>> Hi
>>
>> I wanted to ask whats the best way to achieve per key auto increment
>> numerals after sorting, for eg. :
>>
>> raw file:
>>
>> 1,a,b,c,1,1
>> 1,a,b,d,0,0
>> 1,a,b,e,1,0
>> 2,a,e,c,0,0
>> 2,a,f,d,1,0
>>
>> post-output (the last column is the position number after grouping on
>> first three fields and reverse sorting on last two values)
>>
>> 1,a,b,c,1,1,1
>> 1,a,b,d,0,0,3
>> 1,a,b,e,1,0,2
>> 2,a,e,c,0,0,2
>> 2,a,f,d,1,0,1
>>
>> I am using solution that uses groupbykey but that is running into some
>> issues (possibly bug with pyspark/spark?), wondering if there is a
>> better way to achieve this.
>>
>> My solution:
>>
>> A = A = sc.textFile("train.csv").filter(lambda x:not
>> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
>> None)
>>
>> B = A.map(lambda k:
>> ((k.first_field,k.second_field,k.first_field,k.third_field),
>> (k[0:5]))).groupByKey()
>>
>> B.map(sort_n_set_position).flatMap(lambda line: line)
>>
>> where sort and set position iterates over the iterator and performs
>> sorting and adding last column.
>>
>> best fahad
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread Davies Liu
What's the issue with groupByKey()?

On Mon, Oct 19, 2015 at 1:11 AM, fahad shah  wrote:
> Hi
>
> I wanted to ask whats the best way to achieve per key auto increment
> numerals after sorting, for eg. :
>
> raw file:
>
> 1,a,b,c,1,1
> 1,a,b,d,0,0
> 1,a,b,e,1,0
> 2,a,e,c,0,0
> 2,a,f,d,1,0
>
> post-output (the last column is the position number after grouping on
> first three fields and reverse sorting on last two values)
>
> 1,a,b,c,1,1,1
> 1,a,b,d,0,0,3
> 1,a,b,e,1,0,2
> 2,a,e,c,0,0,2
> 2,a,f,d,1,0,1
>
> I am using solution that uses groupbykey but that is running into some
> issues (possibly bug with pyspark/spark?), wondering if there is a
> better way to achieve this.
>
> My solution:
>
> A = A = sc.textFile("train.csv").filter(lambda x:not
> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
> None)
>
> B = A.map(lambda k:
> ((k.first_field,k.second_field,k.first_field,k.third_field),
> (k[0:5]))).groupByKey()
>
> B.map(sort_n_set_position).flatMap(lambda line: line)
>
> where sort and set position iterates over the iterator and performs
> sorting and adding last column.
>
> best fahad
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org