Re: Maintaining order of pair rdd

2016-07-26 Thread Kuchekar
Hi Janardhan, You could something like this : For maintaining the insertion order by the key first partition by Key (so that each key is located in the same partition) and after that you can do something like this. RDD.mapValues( x => ArrayBuffer(x)).reduceByKey(x,y =>

Re: Maintaining order of pair rdd

2016-07-26 Thread janardhan shetty
Let me provide step wise details: 1. I have an RDD = { (ID2,18159) - *element 1 * (ID1,18159) - *element 2* (ID3,18159) - *element 3* (ID2,36318) - *element 4 * (ID1,36318) - *element 5* (ID3,36318) (ID2,54477) (ID1,54477) (ID3,54477) } 2. RDD.groupByKey().mapValues(v => v.toArray()) Array(

Re: Maintaining order of pair rdd

2016-07-26 Thread Marco Mistroni
Apologies janardhan, i always get confused on this Ok. so you have a (key, val) RDD (val is irrelevant here) then you can do this val reduced = myRDD.reduceByKey((first, second) => first ++ second) val sorted = reduced.sortBy(tpl => tpl._1) hth On Tue, Jul 26, 2016 at 3:31 AM, janardhan

Re: Maintaining order of pair rdd

2016-07-25 Thread janardhan shetty
groupBy is a shuffle operation and index is already lost in this process if I am not wrong and don't see *sortWith* operation on RDD. Any suggestions or help ? On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni wrote: > Hi > after you do a groupBy you should use a sortWith.

Re: Maintaining order of pair rdd

2016-07-25 Thread Marco Mistroni
Hi after you do a groupBy you should use a sortWith. Basically , a groupBy reduces your structure to (anyone correct me if i m wrong) a RDD[(key,val)], which you can see as a tuple.so you could use sortWith (or sortBy, cannot remember which one) (tpl=> tpl._1) hth On Mon, Jul 25, 2016 at

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Thanks Marco. This solved the order problem. Had another question which is prefix to this. As you can see below ID2,ID1 and ID3 are in order and I need to maintain this index order as well. But when we do groupByKey operation(*rdd.distinct.groupByKey().mapValues(v => v.toArray*)) everything is

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni
Hello Uhm you have an array containing 3 tuples? If all the arrays have same length, you can just zip all of them, creatings a list of tuples then you can scan the list 5 by 5...? so something like (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList this will give you a list of tuples of 3

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Array( (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272, 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076, 45431, 100136)), (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244, 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318,

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni
Apologies I misinterpreted could you post two use cases? Kr On 24 Jul 2016 3:41 pm, "janardhan shetty" wrote: > Marco, > > Thanks for the response. It is indexed order and not ascending or > descending order. > On Jul 24, 2016 7:37 AM, "Marco Mistroni"

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Marco, Thanks for the response. It is indexed order and not ascending or descending order. On Jul 24, 2016 7:37 AM, "Marco Mistroni" wrote: > Use map values to transform to an rdd where values are sorted? > Hth > > On 24 Jul 2016 6:23 am, "janardhan shetty"

Maintaining order of pair rdd

2016-07-23 Thread janardhan shetty
I have a key,value pair rdd where value is an array of Ints. I need to maintain the order of the value in order to execute downstream modifications. How do we maintain the order of values? Ex: rdd = (id1,[5,2,3,15], Id2,[9,4,2,5]) Followup question how do we compare between one element in rdd