回复:Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Thank you guys, I got my code worked like below:val record75df = 
sc.parallelize(listForRule75, numPartitions).map(x=> x.replace("|", 
",")).map(_.split(",")).map(x => 
Mycaseclass4(x(0).toInt,x(1).toInt,x(2).toFloat,x(3).toInt)).toDF()
val userids = 1 to 1
val uiddf = sc.parallelize(userids, numPartitions).toDF("userid")
record75df.registerTempTable("b")
uiddf.registerTempTable("a")
val rule75df = sqlContext.sql("select a.*,b.* from a join b")
rule75df.show




 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:Ryan 
收件人:ayan guha 
抄送人:Riccardo Ferrari , luohui20...@sina.com, user 

主题:Re: Re: Is there an operation to create multi record for every element in a 
RDD?
日期:2017年08月09日 17点32分

rdd has a cartesian method

On Wed, Aug 9, 2017 at 5:12 PM, ayan guha  wrote:
If you use join without any condition in becomes cross join. In sql, it looks 
like
Select a.*,b.* from a join b
On Wed, 9 Aug 2017 at 7:08 pm,  wrote:
Riccardo and Ryan   Thank you for your ideas.It seems that crossjoin is a new 
dataset api after spark2.x. my spark version is 1.6.3. Is there a relative 
api to do crossjoin?thank you.




 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:Riccardo Ferrari 
收件人:Ryan 
抄送人:luohui20...@sina.com, user 
主题:Re: Is there an operation to create multi record for every element in a RDD?
日期:2017年08月09日 16点54分

Depends on your Spark version, have you considered the Dataset api?
You can do something like:







val df1 = rdd1.toDF("userid")







val listRDD = sc.parallelize(listForRule77)







val listDF = listRDD.toDF("data")








df1.crossJoin(listDF).orderBy("userid").show(60, 
truncate=false)+--+--+|userid|data  
|+--+--+|1 |1,1,100.00|1483891200,||1 
|1,1,100.00|1483804800,|...|1 |1,1,100.00|1488902400,|
|1 |1,1,100.00|1489075200,||1 |1,1,100.00|1488470400,|...
On Wed, Aug 9, 2017 at 10:44 AM, Ryan  wrote:
It's just sort of inner join operation... If the second dataset isn't very 
large it's ok(btw, you can use flatMap directly instead of map followed by 
flatmap/flattern), otherwise you can register the second one as a rdd/dataset, 
and join them on user id.
On Wed, Aug 9, 2017 at 4:29 PM,   wrote:
hello guys:  I have a simple rdd like :val userIDs = 1 to 1val rdd1 = 
sc.parallelize(userIDs , 16)   //this rdd has 1 user id  And I have a 
List[String] like below:scala> listForRule77
res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, 
1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, 
1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, 
1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, 
1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, 
1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, 
1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, 
1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, 
1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, 
1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400)
scala> listForRule77.length
res77: Int = 29
  I need to create a rdd containing  29 records. for every userid in 
rdd1 , I need to create 29 records according to listForRule77, each record 
start with the userid, for example 1(the userid),1,1,100.00|1483286400.   
My idea is like below:1.write a udfto add the userid to the beginning of every 
string element of listForRule77.2.use val rdd2 = rdd1.map{x=> 
List_udf(x))}.flatmap(), the result rdd2 maybe what I need.
  My question: Are there any problems in my idea? Is there a better way to 
do this ? 



 

ThanksBest regards!
San.Luo





-- 
Best Regards,
Ayan Guha






Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
rdd has a cartesian method

On Wed, Aug 9, 2017 at 5:12 PM, ayan guha  wrote:

> If you use join without any condition in becomes cross join. In sql, it
> looks like
>
> Select a.*,b.* from a join b
>
> On Wed, 9 Aug 2017 at 7:08 pm,  wrote:
>
>> Riccardo and Ryan
>>Thank you for your ideas.It seems that crossjoin is a new dataset api
>> after spark2.x.
>> my spark version is 1.6.3. Is there a relative api to do crossjoin?
>> thank you.
>>
>>
>>
>> 
>>
>> ThanksBest regards!
>> San.Luo
>>
>> - 原始邮件 -
>> 发件人:Riccardo Ferrari 
>> 收件人:Ryan 
>> 抄送人:luohui20...@sina.com, user 
>> 主题:Re: Is there an operation to create multi record for every element in
>> a RDD?
>> 日期:2017年08月09日 16点54分
>>
>> Depends on your Spark version, have you considered the Dataset api?
>>
>> You can do something like:
>>
>> val df1 = rdd1.toDF("userid")
>>
>> val listRDD = sc.parallelize(listForRule77)
>>
>> val listDF = listRDD.toDF("data")
>>
>> df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)
>>
>> +--+--+
>>
>> |userid|data  |
>>
>> +--+--+
>>
>> |1 |1,1,100.00|1483891200,|
>>
>> |1 |1,1,100.00|1483804800,|
>>
>> ...
>>
>> |1 |1,1,100.00|1488902400,|
>>
>> |1 |1,1,100.00|1489075200,|
>>
>> |1 |1,1,100.00|1488470400,|
>>
>> ...
>>
>> On Wed, Aug 9, 2017 at 10:44 AM, Ryan  wrote:
>>
>> It's just sort of inner join operation... If the second dataset isn't
>> very large it's ok(btw, you can use flatMap directly instead of map
>> followed by flatmap/flattern), otherwise you can register the second one as
>> a rdd/dataset, and join them on user id.
>>
>> On Wed, Aug 9, 2017 at 4:29 PM,  wrote:
>>
>> hello guys:
>>   I have a simple rdd like :
>> val userIDs = 1 to 1
>> val rdd1 = sc.parallelize(userIDs , 16)   //this rdd has 1 user id
>>   And I have a List[String] like below:
>> scala> listForRule77
>> res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,
>> 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000,
>> 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200,
>> 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800,
>> 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400,
>> 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600,
>> 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800,
>> 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000,
>> 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200,
>> 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400)
>>
>> scala> listForRule77.length
>> res77: Int = 29
>>
>>   I need to create a rdd containing  29 records. for every userid
>> in rdd1 , I need to create 29 records according to listForRule77, each
>> record start with the userid, for example 1(the
>> userid),1,1,100.00|1483286400.
>>   My idea is like below:
>> 1.write a udf
>> to add the userid to the beginning of every string element
>> of listForRule77.
>> 2.use
>> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap()
>> , the result rdd2 maybe what I need.
>>
>>   My question: Are there any problems in my idea? Is there a better
>> way to do this ?
>>
>>
>>
>> 
>>
>> ThanksBest regards!
>> San.Luo
>>
>>
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread ayan guha
If you use join without any condition in becomes cross join. In sql, it
looks like

Select a.*,b.* from a join b

On Wed, 9 Aug 2017 at 7:08 pm,  wrote:

> Riccardo and Ryan
>Thank you for your ideas.It seems that crossjoin is a new dataset api
> after spark2.x.
> my spark version is 1.6.3. Is there a relative api to do crossjoin?
> thank you.
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
> - 原始邮件 -
> 发件人:Riccardo Ferrari 
> 收件人:Ryan 
> 抄送人:luohui20...@sina.com, user 
> 主题:Re: Is there an operation to create multi record for every element in a
> RDD?
> 日期:2017年08月09日 16点54分
>
> Depends on your Spark version, have you considered the Dataset api?
>
> You can do something like:
>
> val df1 = rdd1.toDF("userid")
>
> val listRDD = sc.parallelize(listForRule77)
>
> val listDF = listRDD.toDF("data")
>
> df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)
>
> +--+--+
>
> |userid|data  |
>
> +--+--+
>
> |1 |1,1,100.00|1483891200,|
>
> |1 |1,1,100.00|1483804800,|
>
> ...
>
> |1 |1,1,100.00|1488902400,|
>
> |1 |1,1,100.00|1489075200,|
>
> |1 |1,1,100.00|1488470400,|
>
> ...
>
> On Wed, Aug 9, 2017 at 10:44 AM, Ryan  wrote:
>
> It's just sort of inner join operation... If the second dataset isn't very
> large it's ok(btw, you can use flatMap directly instead of map followed by
> flatmap/flattern), otherwise you can register the second one as a
> rdd/dataset, and join them on user id.
>
> On Wed, Aug 9, 2017 at 4:29 PM,  wrote:
>
> hello guys:
>   I have a simple rdd like :
> val userIDs = 1 to 1
> val rdd1 = sc.parallelize(userIDs , 16)   //this rdd has 1 user id
>   And I have a List[String] like below:
> scala> listForRule77
> res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,
> 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000,
> 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200,
> 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800,
> 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400,
> 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600,
> 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800,
> 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000,
> 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200,
> 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400)
>
> scala> listForRule77.length
> res77: Int = 29
>
>   I need to create a rdd containing  29 records. for every userid
> in rdd1 , I need to create 29 records according to listForRule77, each
> record start with the userid, for example 1(the
> userid),1,1,100.00|1483286400.
>   My idea is like below:
> 1.write a udf
> to add the userid to the beginning of every string element
> of listForRule77.
> 2.use
> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap()
> , the result rdd2 maybe what I need.
>
>   My question: Are there any problems in my idea? Is there a better
> way to do this ?
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
>
>
> --
Best Regards,
Ayan Guha


回复:Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Riccardo and Ryan   Thank you for your ideas.It seems that crossjoin is a new 
dataset api after spark2.x. my spark version is 1.6.3. Is there a relative 
api to do crossjoin?thank you.




 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:Riccardo Ferrari 
收件人:Ryan 
抄送人:luohui20...@sina.com, user 
主题:Re: Is there an operation to create multi record for every element in a RDD?
日期:2017年08月09日 16点54分

Depends on your Spark version, have you considered the Dataset api?
You can do something like:







val df1 = rdd1.toDF("userid")







val listRDD = sc.parallelize(listForRule77)







val listDF = listRDD.toDF("data")








df1.crossJoin(listDF).orderBy("userid").show(60, 
truncate=false)+--+--+|userid|data  
|+--+--+|1 |1,1,100.00|1483891200,||1 
|1,1,100.00|1483804800,|...|1 |1,1,100.00|1488902400,|
|1 |1,1,100.00|1489075200,||1 |1,1,100.00|1488470400,|...
On Wed, Aug 9, 2017 at 10:44 AM, Ryan  wrote:
It's just sort of inner join operation... If the second dataset isn't very 
large it's ok(btw, you can use flatMap directly instead of map followed by 
flatmap/flattern), otherwise you can register the second one as a rdd/dataset, 
and join them on user id.
On Wed, Aug 9, 2017 at 4:29 PM,   wrote:
hello guys:  I have a simple rdd like :val userIDs = 1 to 1val rdd1 = 
sc.parallelize(userIDs , 16)   //this rdd has 1 user id  And I have a 
List[String] like below:scala> listForRule77
res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, 
1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000, 
1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200, 
1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800, 
1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400, 
1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600, 
1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800, 
1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000, 
1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200, 
1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400)
scala> listForRule77.length
res77: Int = 29
  I need to create a rdd containing  29 records. for every userid in 
rdd1 , I need to create 29 records according to listForRule77, each record 
start with the userid, for example 1(the userid),1,1,100.00|1483286400.   
My idea is like below:1.write a udfto add the userid to the beginning of every 
string element of listForRule77.2.use val rdd2 = rdd1.map{x=> 
List_udf(x))}.flatmap(), the result rdd2 maybe what I need.
  My question: Are there any problems in my idea? Is there a better way to 
do this ? 



 

ThanksBest regards!
San.Luo







Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Riccardo Ferrari
Depends on your Spark version, have you considered the Dataset api?

You can do something like:

val df1 = rdd1.toDF("userid")

val listRDD = sc.parallelize(listForRule77)

val listDF = listRDD.toDF("data")

df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)

+--+--+

|userid|data  |

+--+--+

|1 |1,1,100.00|1483891200,|

|1 |1,1,100.00|1483804800,|

...

|1 |1,1,100.00|1488902400,|

|1 |1,1,100.00|1489075200,|

|1 |1,1,100.00|1488470400,|

...

On Wed, Aug 9, 2017 at 10:44 AM, Ryan  wrote:

> It's just sort of inner join operation... If the second dataset isn't very
> large it's ok(btw, you can use flatMap directly instead of map followed by
> flatmap/flattern), otherwise you can register the second one as a
> rdd/dataset, and join them on user id.
>
> On Wed, Aug 9, 2017 at 4:29 PM,  wrote:
>
>> hello guys:
>>   I have a simple rdd like :
>> val userIDs = 1 to 1
>> val rdd1 = sc.parallelize(userIDs , 16)   //this rdd has 1 user id
>>   And I have a List[String] like below:
>> scala> listForRule77
>> res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,
>> 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000,
>> 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200,
>> 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800,
>> 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400,
>> 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600,
>> 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800,
>> 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000,
>> 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200,
>> 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400)
>>
>> scala> listForRule77.length
>> res77: Int = 29
>>
>>   I need to create a rdd containing  29 records. for every userid
>> in rdd1 , I need to create 29 records according to listForRule77, each
>> record start with the userid, for example 1(the
>> userid),1,1,100.00|1483286400.
>>   My idea is like below:
>> 1.write a udf
>> to add the userid to the beginning of every string element
>> of listForRule77.
>> 2.use
>> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap()
>> , the result rdd2 maybe what I need.
>>
>>   My question: Are there any problems in my idea? Is there a better
>> way to do this ?
>>
>>
>>
>> 
>>
>> ThanksBest regards!
>> San.Luo
>>
>
>


Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
It's just sort of inner join operation... If the second dataset isn't very
large it's ok(btw, you can use flatMap directly instead of map followed by
flatmap/flattern), otherwise you can register the second one as a
rdd/dataset, and join them on user id.

On Wed, Aug 9, 2017 at 4:29 PM,  wrote:

> hello guys:
>   I have a simple rdd like :
> val userIDs = 1 to 1
> val rdd1 = sc.parallelize(userIDs , 16)   //this rdd has 1 user id
>   And I have a List[String] like below:
> scala> listForRule77
> res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,
> 1,1,100.00|1483459200, 1,1,100.00|1483545600, 1,1,100.00|1483632000,
> 1,1,100.00|1483718400, 1,1,100.00|1483804800, 1,1,100.00|1483891200,
> 1,1,100.00|1483977600, 3,1,200.00|1485878400, 1,1,100.00|1485964800,
> 1,1,100.00|1486051200, 1,1,100.00|1488384000, 1,1,100.00|1488470400,
> 1,1,100.00|1488556800, 1,1,100.00|1488643200, 1,1,100.00|1488729600,
> 1,1,100.00|1488816000, 1,1,100.00|1488902400, 1,1,100.00|1488988800,
> 1,1,100.00|1489075200, 1,1,100.00|1489161600, 1,1,100.00|1489248000,
> 1,1,100.00|1489334400, 1,1,100.00|1489420800, 1,1,100.00|1489507200,
> 1,1,100.00|1489593600, 1,1,100.00|148968, 1,1,100.00|1489766400)
>
> scala> listForRule77.length
> res77: Int = 29
>
>   I need to create a rdd containing  29 records. for every userid
> in rdd1 , I need to create 29 records according to listForRule77, each
> record start with the userid, for example 1(the
> userid),1,1,100.00|1483286400.
>   My idea is like below:
> 1.write a udf
> to add the userid to the beginning of every string element
> of listForRule77.
> 2.use
> val rdd2 = rdd1.map{x=> List_udf(x))}.flatmap()
> , the result rdd2 maybe what I need.
>
>   My question: Are there any problems in my idea? Is there a better
> way to do this ?
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>