Thanks Davies,
groupbykey was throwing up the error: unpack requires a string
argument of length 4
interestingly, I replace that with the sortbykey (which i read also
shuffles so that data for same key are on same partition) and it ran
fine - wondering if this a bug on groupbykey for Spark 1.3?
best fahad
On Mon, Oct 19, 2015 at 10:45 AM, Davies Liu wrote:
> What's the issue with groupByKey()?
>
> On Mon, Oct 19, 2015 at 1:11 AM, fahad shah wrote:
>> Hi
>>
>> I wanted to ask whats the best way to achieve per key auto increment
>> numerals after sorting, for eg. :
>>
>> raw file:
>>
>> 1,a,b,c,1,1
>> 1,a,b,d,0,0
>> 1,a,b,e,1,0
>> 2,a,e,c,0,0
>> 2,a,f,d,1,0
>>
>> post-output (the last column is the position number after grouping on
>> first three fields and reverse sorting on last two values)
>>
>> 1,a,b,c,1,1,1
>> 1,a,b,d,0,0,3
>> 1,a,b,e,1,0,2
>> 2,a,e,c,0,0,2
>> 2,a,f,d,1,0,1
>>
>> I am using solution that uses groupbykey but that is running into some
>> issues (possibly bug with pyspark/spark?), wondering if there is a
>> better way to achieve this.
>>
>> My solution:
>>
>> A = A = sc.textFile("train.csv").filter(lambda x:not
>> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
>> None)
>>
>> B = A.map(lambda k:
>> ((k.first_field,k.second_field,k.first_field,k.third_field),
>> (k[0:5]))).groupByKey()
>>
>> B.map(sort_n_set_position).flatMap(lambda line: line)
>>
>> where sort and set position iterates over the iterator and performs
>> sorting and adding last column.
>>
>> best fahad
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org