Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Hao Ren
)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
at
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:817)
at
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:786)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86)
at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.ready(package.scala:86)
at akka.actor.ActorSystemImpl.awaitTermination(ActorSystem.scala:642)
at akka.actor.ActorSystemImpl.awaitTermination(ActorSystem.scala:643)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:531)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)

VM Thread prio=10 tid=0x7f149407d000 nid=0xe75 runnable

GC task thread#0 (ParallelGC) prio=10 tid=0x7f149401f000 nid=0xe6d
runnable

GC task thread#1 (ParallelGC) prio=10 tid=0x7f1494021000 nid=0xe6e
runnable

GC task thread#2 (ParallelGC) prio=10 tid=0x7f1494022800 nid=0xe6f
runnable

GC task thread#3 (ParallelGC) prio=10 tid=0x7f1494024800 nid=0xe70
runnable

GC task thread#4 (ParallelGC) prio=10 tid=0x7f1494026800 nid=0xe71
runnable

GC task thread#5 (ParallelGC) prio=10 tid=0x7f1494028000 nid=0xe72
runnable

GC task thread#6 (ParallelGC) prio=10 tid=0x7f149402a000 nid=0xe73
runnable

GC task thread#7 (ParallelGC) prio=10 tid=0x7f149402c000 nid=0xe74
runnable

VM Periodic Task Thread prio=10 tid=0x7f14940c2800 nid=0xe7c waiting
on condition

JNI global references: 230


Tell me if anything else is needed.

Thank you.

Hao.


On Tue, Apr 7, 2015 at 8:00 PM, Michael Armbrust mich...@databricks.com
wrote:

 The joins here are totally different implementations, but it is worrisome
 that you are seeing the SQL join hanging.  Can you provide more information
 about the hang?  jstack of the driver and a worker that is processing a
 task would be very useful.

 On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren inv...@gmail.com wrote:

 Hi,

 We have 2 hive tables and want to join one with the other.

 Initially, we ran a sql request on HiveContext. But it did not work. It
 was
 blocked on 30/600 tasks.
 Then we tried to load tables into two DataFrames, we have encountered the
 same problem.
 Finally, it works with RDD.join. What we have done is basically
 transforming
 2 tables into 2 pair RDDs, then calling a join operation. It works great
 in
 about 500 s.

 However, workaround is just a workaround, since we have to transform hive
 tables into RDD. This is really annoying.

 Just wondering whether the underlying code of DF/SQL's join operation is
 the
 same as rdd's, knowing that there is a syntax analysis layer for DF/SQL,
 while RDD's join is straightforward on two pair RDDs.

 SQL request:
 --
 select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount
 from table1 as v1 left join table2 as v2
 on v1.receipt_id = v2.receipt_id
 where v1.sku != 

 DataFrame:

 -
 val rdd1 = ss.hiveContext.table(table1)
 val rdd1Filt = rdd1.filter(rdd1.col(sku) !== )
 val rdd2 = ss.hiveContext.table(table2)
 val rddJoin = rdd1Filt.join(rdd2, rdd1Filt(receipt_id) ===
 rdd2(receipt_id))
 rddJoin.saveAsTable(testJoinTable, SaveMode.Overwrite)

 RDD workaround in this case is a bit cumbersome, for short, we just
 created
 2 RDDs, join, and then apply a new schema on the result RDD. This approach
 works, at least all tasks were finished, while the DF/SQL approach don't.

 Any idea ?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Hao Ren

{Data, Software} Engineer @ ClaraVista

Paris, France


Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Michael Armbrust
, SaveMode.Overwrite)

 RDD workaround in this case is a bit cumbersome, for short, we just
 created
 2 RDDs, join, and then apply a new schema on the result RDD. This
 approach
 works, at least all tasks were finished, while the DF/SQL approach don't.

 Any idea ?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Hao Ren

 {Data, Software} Engineer @ ClaraVista

 Paris, France



The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Hao Ren
Hi,

We have 2 hive tables and want to join one with the other.

Initially, we ran a sql request on HiveContext. But it did not work. It was
blocked on 30/600 tasks.
Then we tried to load tables into two DataFrames, we have encountered the
same problem.
Finally, it works with RDD.join. What we have done is basically transforming
2 tables into 2 pair RDDs, then calling a join operation. It works great in
about 500 s. 

However, workaround is just a workaround, since we have to transform hive
tables into RDD. This is really annoying.

Just wondering whether the underlying code of DF/SQL's join operation is the
same as rdd's, knowing that there is a syntax analysis layer for DF/SQL,
while RDD's join is straightforward on two pair RDDs.

SQL request:
--
select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount
from table1 as v1 left join table2 as v2
on v1.receipt_id = v2.receipt_id
where v1.sku != 

DataFrame:
-
val rdd1 = ss.hiveContext.table(table1)
val rdd1Filt = rdd1.filter(rdd1.col(sku) !== )
val rdd2 = ss.hiveContext.table(table2)
val rddJoin = rdd1Filt.join(rdd2, rdd1Filt(receipt_id) ===
rdd2(receipt_id))
rddJoin.saveAsTable(testJoinTable, SaveMode.Overwrite)

RDD workaround in this case is a bit cumbersome, for short, we just created
2 RDDs, join, and then apply a new schema on the result RDD. This approach
works, at least all tasks were finished, while the DF/SQL approach don't.

Any idea ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Michael Armbrust
The joins here are totally different implementations, but it is worrisome
that you are seeing the SQL join hanging.  Can you provide more information
about the hang?  jstack of the driver and a worker that is processing a
task would be very useful.

On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren inv...@gmail.com wrote:

 Hi,

 We have 2 hive tables and want to join one with the other.

 Initially, we ran a sql request on HiveContext. But it did not work. It was
 blocked on 30/600 tasks.
 Then we tried to load tables into two DataFrames, we have encountered the
 same problem.
 Finally, it works with RDD.join. What we have done is basically
 transforming
 2 tables into 2 pair RDDs, then calling a join operation. It works great in
 about 500 s.

 However, workaround is just a workaround, since we have to transform hive
 tables into RDD. This is really annoying.

 Just wondering whether the underlying code of DF/SQL's join operation is
 the
 same as rdd's, knowing that there is a syntax analysis layer for DF/SQL,
 while RDD's join is straightforward on two pair RDDs.

 SQL request:
 --
 select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount
 from table1 as v1 left join table2 as v2
 on v1.receipt_id = v2.receipt_id
 where v1.sku != 

 DataFrame:

 -
 val rdd1 = ss.hiveContext.table(table1)
 val rdd1Filt = rdd1.filter(rdd1.col(sku) !== )
 val rdd2 = ss.hiveContext.table(table2)
 val rddJoin = rdd1Filt.join(rdd2, rdd1Filt(receipt_id) ===
 rdd2(receipt_id))
 rddJoin.saveAsTable(testJoinTable, SaveMode.Overwrite)

 RDD workaround in this case is a bit cumbersome, for short, we just created
 2 RDDs, join, and then apply a new schema on the result RDD. This approach
 works, at least all tasks were finished, while the DF/SQL approach don't.

 Any idea ?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org