回复:Re: Re: Is there an operation to create multi record for every element in a RDD?
Thank you guys, I got my code worked like below:val record75df = sc.parallelize(listForRule75, numPartitions).map(x=> x.replace("|", ",")).map(_.split(",")).map(x => Mycaseclass4(x(0).toInt,x(1).toInt,x(2).toFloat,x(3).toInt)).toDF() val userids = 1 to 1 val uiddf = sc.parallelize(userids, numPartitions).toDF("userid") record75df.registerTempTable("b") uiddf.registerTempTable("a") val rule75df = sqlContext.sql("select a.*,b.* from a join b") rule75df.show ThanksBest regards! San.Luo - 原始邮件 - 发件人:Ryan收件人:ayan guha 抄送人:Riccardo Ferrari , luohui20...@sina.com, user 主题:Re: Re: Is there an operation to create multi record for every element in a RDD? 日期:2017年08月09日 17点32分 rdd has a cartesian method On Wed, Aug 9, 2017 at 5:12 PM, ayan guha wrote: If you use join without any condition in becomes cross join. In sql, it looks like Select a.*,b.* from a join b On Wed, 9 Aug 2017 at 7:08 pm, wrote: Riccardo and Ryan Thank you for your ideas.It seems that crossjoin is a new dataset api after spark2.x. my spark version is 1.6.3. Is there a relative api to do crossjoin?thank you. ThanksBest regards! San.Luo - 原始邮件 - 发件人:Riccardo Ferrari 收件人:Ryan 抄送人:luohui20...@sina.com, user 主题:Re: Is there an operation to create multi record for every element in a RDD? 日期:2017年08月09日 16点54分 Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)+--+--+|userid|data |+--+--+|1 |1,1,100.00|1483891200,||1 |1,1,100.00|1483804800,|...|1 |1,1,100.00|1488902400,| |1 |1,1,100.00|1489075200,||1 |1,1,100.00|1488470400,|... On Wed, Aug 9, 2017 at 10:44 AM, Ryan wrote: It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM, wrote: hello guys: I have a simple rdd like :val userIDs = 1 to 1val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id And I have a List[String] like below:scala> listForRule77 res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400) scala> listForRule77.length res77: Int = 29 I need to create a rdd containing 29 records. for every userid in rdd1 , I need to create 29 records according to listForRule77, each record start with the userid, for example 1(the userid),1,1,100.00|1483286400. My idea is like below:1.write a udfto add the userid to the beginning of every string element of listForRule77.2.use val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap(), the result rdd2 maybe what I need. My question: Are there any problems in my idea? Is there a better way to do this ? ThanksBest regards! San.Luo -- Best Regards, Ayan Guha
Re: Re: Is there an operation to create multi record for every element in a RDD?
rdd has a cartesian method On Wed, Aug 9, 2017 at 5:12 PM, ayan guhawrote: > If you use join without any condition in becomes cross join. In sql, it > looks like > > Select a.*,b.* from a join b > > On Wed, 9 Aug 2017 at 7:08 pm, wrote: > >> Riccardo and Ryan >>Thank you for your ideas.It seems that crossjoin is a new dataset api >> after spark2.x. >> my spark version is 1.6.3. Is there a relative api to do crossjoin? >> thank you. >> >> >> >> >> >> ThanksBest regards! >> San.Luo >> >> - 原始邮件 - >> 发件人:Riccardo Ferrari >> 收件人:Ryan >> 抄送人:luohui20...@sina.com, user >> 主题:Re: Is there an operation to create multi record for every element in >> a RDD? >> 日期:2017年08月09日 16点54分 >> >> Depends on your Spark version, have you considered the Dataset api? >> >> You can do something like: >> >> val df1 = rdd1.toDF("userid") >> >> val listRDD = sc.parallelize(listForRule77) >> >> val listDF = listRDD.toDF("data") >> >> df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false) >> >> +--+--+ >> >> |userid|data | >> >> +--+--+ >> >> |1 |1,1,100.00|1483891200,| >> >> |1 |1,1,100.00|1483804800,| >> >> ... >> >> |1 |1,1,100.00|1488902400,| >> >> |1 |1,1,100.00|1489075200,| >> >> |1 |1,1,100.00|1488470400,| >> >> ... >> >> On Wed, Aug 9, 2017 at 10:44 AM, Ryan wrote: >> >> It's just sort of inner join operation... If the second dataset isn't >> very large it's ok(btw, you can use flatMap directly instead of map >> followed by flatmap/flattern), otherwise you can register the second one as >> a rdd/dataset, and join them on user id. >> >> On Wed, Aug 9, 2017 at 4:29 PM, wrote: >> >> hello guys: >> I have a simple rdd like : >> val userIDs = 1 to 1 >> val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id >> And I have a List[String] like below: >> scala> listForRule77 >> res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, >> 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, >> 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, >> 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, >> 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, >> 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, >> 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, >> 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, >> 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, >> 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400) >> >> scala> listForRule77.length >> res77: Int = 29 >> >> I need to create a rdd containing 29 records. for every userid >> in rdd1 , I need to create 29 records according to listForRule77, each >> record start with the userid, for example 1(the >> userid),1,1,100.00|1483286400. >> My idea is like below: >> 1.write a udf >> to add the userid to the beginning of every string element >> of listForRule77. >> 2.use >> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap() >> , the result rdd2 maybe what I need. >> >> My question: Are there any problems in my idea? Is there a better >> way to do this ? >> >> >> >> >> >> ThanksBest regards! >> San.Luo >> >> >> >> -- > Best Regards, > Ayan Guha >
Re: Re: Is there an operation to create multi record for every element in a RDD?
If you use join without any condition in becomes cross join. In sql, it looks like Select a.*,b.* from a join b On Wed, 9 Aug 2017 at 7:08 pm,wrote: > Riccardo and Ryan >Thank you for your ideas.It seems that crossjoin is a new dataset api > after spark2.x. > my spark version is 1.6.3. Is there a relative api to do crossjoin? > thank you. > > > > > > ThanksBest regards! > San.Luo > > - 原始邮件 - > 发件人:Riccardo Ferrari > 收件人:Ryan > 抄送人:luohui20...@sina.com, user > 主题:Re: Is there an operation to create multi record for every element in a > RDD? > 日期:2017年08月09日 16点54分 > > Depends on your Spark version, have you considered the Dataset api? > > You can do something like: > > val df1 = rdd1.toDF("userid") > > val listRDD = sc.parallelize(listForRule77) > > val listDF = listRDD.toDF("data") > > df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false) > > +--+--+ > > |userid|data | > > +--+--+ > > |1 |1,1,100.00|1483891200,| > > |1 |1,1,100.00|1483804800,| > > ... > > |1 |1,1,100.00|1488902400,| > > |1 |1,1,100.00|1489075200,| > > |1 |1,1,100.00|1488470400,| > > ... > > On Wed, Aug 9, 2017 at 10:44 AM, Ryan wrote: > > It's just sort of inner join operation... If the second dataset isn't very > large it's ok(btw, you can use flatMap directly instead of map followed by > flatmap/flattern), otherwise you can register the second one as a > rdd/dataset, and join them on user id. > > On Wed, Aug 9, 2017 at 4:29 PM, wrote: > > hello guys: > I have a simple rdd like : > val userIDs = 1 to 1 > val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id > And I have a List[String] like below: > scala> listForRule77 > res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, > 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, > 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, > 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, > 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, > 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, > 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, > 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, > 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, > 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400) > > scala> listForRule77.length > res77: Int = 29 > > I need to create a rdd containing 29 records. for every userid > in rdd1 , I need to create 29 records according to listForRule77, each > record start with the userid, for example 1(the > userid),1,1,100.00|1483286400. > My idea is like below: > 1.write a udf > to add the userid to the beginning of every string element > of listForRule77. > 2.use > val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap() > , the result rdd2 maybe what I need. > > My question: Are there any problems in my idea? Is there a better > way to do this ? > > > > > > ThanksBest regards! > San.Luo > > > > -- Best Regards, Ayan Guha
回复:Re: Is there an operation to create multi record for every element in a RDD?
Riccardo and Ryan Thank you for your ideas.It seems that crossjoin is a new dataset api after spark2.x. my spark version is 1.6.3. Is there a relative api to do crossjoin?thank you. ThanksBest regards! San.Luo - 原始邮件 - 发件人:Riccardo Ferrari收件人:Ryan 抄送人:luohui20...@sina.com, user 主题:Re: Is there an operation to create multi record for every element in a RDD? 日期:2017年08月09日 16点54分 Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)+--+--+|userid|data |+--+--+|1 |1,1,100.00|1483891200,||1 |1,1,100.00|1483804800,|...|1 |1,1,100.00|1488902400,| |1 |1,1,100.00|1489075200,||1 |1,1,100.00|1488470400,|... On Wed, Aug 9, 2017 at 10:44 AM, Ryan wrote: It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM, wrote: hello guys: I have a simple rdd like :val userIDs = 1 to 1val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id And I have a List[String] like below:scala> listForRule77 res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400) scala> listForRule77.length res77: Int = 29 I need to create a rdd containing 29 records. for every userid in rdd1 , I need to create 29 records according to listForRule77, each record start with the userid, for example 1(the userid),1,1,100.00|1483286400. My idea is like below:1.write a udfto add the userid to the beginning of every string element of listForRule77.2.use val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap(), the result rdd2 maybe what I need. My question: Are there any problems in my idea? Is there a better way to do this ? ThanksBest regards! San.Luo
Re: Is there an operation to create multi record for every element in a RDD?
Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false) +--+--+ |userid|data | +--+--+ |1 |1,1,100.00|1483891200,| |1 |1,1,100.00|1483804800,| ... |1 |1,1,100.00|1488902400,| |1 |1,1,100.00|1489075200,| |1 |1,1,100.00|1488470400,| ... On Wed, Aug 9, 2017 at 10:44 AM, Ryanwrote: > It's just sort of inner join operation... If the second dataset isn't very > large it's ok(btw, you can use flatMap directly instead of map followed by > flatmap/flattern), otherwise you can register the second one as a > rdd/dataset, and join them on user id. > > On Wed, Aug 9, 2017 at 4:29 PM, wrote: > >> hello guys: >> I have a simple rdd like : >> val userIDs = 1 to 1 >> val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id >> And I have a List[String] like below: >> scala> listForRule77 >> res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, >> 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, >> 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, >> 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, >> 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, >> 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, >> 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, >> 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, >> 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, >> 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400) >> >> scala> listForRule77.length >> res77: Int = 29 >> >> I need to create a rdd containing 29 records. for every userid >> in rdd1 , I need to create 29 records according to listForRule77, each >> record start with the userid, for example 1(the >> userid),1,1,100.00|1483286400. >> My idea is like below: >> 1.write a udf >> to add the userid to the beginning of every string element >> of listForRule77. >> 2.use >> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap() >> , the result rdd2 maybe what I need. >> >> My question: Are there any problems in my idea? Is there a better >> way to do this ? >> >> >> >> >> >> ThanksBest regards! >> San.Luo >> > >
Re: Is there an operation to create multi record for every element in a RDD?
It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM,wrote: > hello guys: > I have a simple rdd like : > val userIDs = 1 to 1 > val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id > And I have a List[String] like below: > scala> listForRule77 > res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, > 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, > 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, > 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, > 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, > 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, > 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, > 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, > 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, > 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400) > > scala> listForRule77.length > res77: Int = 29 > > I need to create a rdd containing 29 records. for every userid > in rdd1 , I need to create 29 records according to listForRule77, each > record start with the userid, for example 1(the > userid),1,1,100.00|1483286400. > My idea is like below: > 1.write a udf > to add the userid to the beginning of every string element > of listForRule77. > 2.use > val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap() > , the result rdd2 maybe what I need. > > My question: Are there any problems in my idea? Is there a better > way to do this ? > > > > > > ThanksBest regards! > San.Luo >