Re: The differentce between SparkSql/DataFram join and Rdd join
) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236) at akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:817) at akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:786) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:86) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.ready(package.scala:86) at akka.actor.ActorSystemImpl.awaitTermination(ActorSystem.scala:642) at akka.actor.ActorSystemImpl.awaitTermination(ActorSystem.scala:643) at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:531) at org.apache.spark.deploy.worker.Worker.main(Worker.scala) VM Thread prio=10 tid=0x7f149407d000 nid=0xe75 runnable GC task thread#0 (ParallelGC) prio=10 tid=0x7f149401f000 nid=0xe6d runnable GC task thread#1 (ParallelGC) prio=10 tid=0x7f1494021000 nid=0xe6e runnable GC task thread#2 (ParallelGC) prio=10 tid=0x7f1494022800 nid=0xe6f runnable GC task thread#3 (ParallelGC) prio=10 tid=0x7f1494024800 nid=0xe70 runnable GC task thread#4 (ParallelGC) prio=10 tid=0x7f1494026800 nid=0xe71 runnable GC task thread#5 (ParallelGC) prio=10 tid=0x7f1494028000 nid=0xe72 runnable GC task thread#6 (ParallelGC) prio=10 tid=0x7f149402a000 nid=0xe73 runnable GC task thread#7 (ParallelGC) prio=10 tid=0x7f149402c000 nid=0xe74 runnable VM Periodic Task Thread prio=10 tid=0x7f14940c2800 nid=0xe7c waiting on condition JNI global references: 230 Tell me if anything else is needed. Thank you. Hao. On Tue, Apr 7, 2015 at 8:00 PM, Michael Armbrust mich...@databricks.com wrote: The joins here are totally different implementations, but it is worrisome that you are seeing the SQL join hanging. Can you provide more information about the hang? jstack of the driver and a worker that is processing a task would be very useful. On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren inv...@gmail.com wrote: Hi, We have 2 hive tables and want to join one with the other. Initially, we ran a sql request on HiveContext. But it did not work. It was blocked on 30/600 tasks. Then we tried to load tables into two DataFrames, we have encountered the same problem. Finally, it works with RDD.join. What we have done is basically transforming 2 tables into 2 pair RDDs, then calling a join operation. It works great in about 500 s. However, workaround is just a workaround, since we have to transform hive tables into RDD. This is really annoying. Just wondering whether the underlying code of DF/SQL's join operation is the same as rdd's, knowing that there is a syntax analysis layer for DF/SQL, while RDD's join is straightforward on two pair RDDs. SQL request: -- select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount from table1 as v1 left join table2 as v2 on v1.receipt_id = v2.receipt_id where v1.sku != DataFrame: - val rdd1 = ss.hiveContext.table(table1) val rdd1Filt = rdd1.filter(rdd1.col(sku) !== ) val rdd2 = ss.hiveContext.table(table2) val rddJoin = rdd1Filt.join(rdd2, rdd1Filt(receipt_id) === rdd2(receipt_id)) rddJoin.saveAsTable(testJoinTable, SaveMode.Overwrite) RDD workaround in this case is a bit cumbersome, for short, we just created 2 RDDs, join, and then apply a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Hao Ren {Data, Software} Engineer @ ClaraVista Paris, France
Re: The differentce between SparkSql/DataFram join and Rdd join
, SaveMode.Overwrite) RDD workaround in this case is a bit cumbersome, for short, we just created 2 RDDs, join, and then apply a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Hao Ren {Data, Software} Engineer @ ClaraVista Paris, France
The differentce between SparkSql/DataFram join and Rdd join
Hi, We have 2 hive tables and want to join one with the other. Initially, we ran a sql request on HiveContext. But it did not work. It was blocked on 30/600 tasks. Then we tried to load tables into two DataFrames, we have encountered the same problem. Finally, it works with RDD.join. What we have done is basically transforming 2 tables into 2 pair RDDs, then calling a join operation. It works great in about 500 s. However, workaround is just a workaround, since we have to transform hive tables into RDD. This is really annoying. Just wondering whether the underlying code of DF/SQL's join operation is the same as rdd's, knowing that there is a syntax analysis layer for DF/SQL, while RDD's join is straightforward on two pair RDDs. SQL request: -- select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount from table1 as v1 left join table2 as v2 on v1.receipt_id = v2.receipt_id where v1.sku != DataFrame: - val rdd1 = ss.hiveContext.table(table1) val rdd1Filt = rdd1.filter(rdd1.col(sku) !== ) val rdd2 = ss.hiveContext.table(table2) val rddJoin = rdd1Filt.join(rdd2, rdd1Filt(receipt_id) === rdd2(receipt_id)) rddJoin.saveAsTable(testJoinTable, SaveMode.Overwrite) RDD workaround in this case is a bit cumbersome, for short, we just created 2 RDDs, join, and then apply a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: The differentce between SparkSql/DataFram join and Rdd join
The joins here are totally different implementations, but it is worrisome that you are seeing the SQL join hanging. Can you provide more information about the hang? jstack of the driver and a worker that is processing a task would be very useful. On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren inv...@gmail.com wrote: Hi, We have 2 hive tables and want to join one with the other. Initially, we ran a sql request on HiveContext. But it did not work. It was blocked on 30/600 tasks. Then we tried to load tables into two DataFrames, we have encountered the same problem. Finally, it works with RDD.join. What we have done is basically transforming 2 tables into 2 pair RDDs, then calling a join operation. It works great in about 500 s. However, workaround is just a workaround, since we have to transform hive tables into RDD. This is really annoying. Just wondering whether the underlying code of DF/SQL's join operation is the same as rdd's, knowing that there is a syntax analysis layer for DF/SQL, while RDD's join is straightforward on two pair RDDs. SQL request: -- select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount from table1 as v1 left join table2 as v2 on v1.receipt_id = v2.receipt_id where v1.sku != DataFrame: - val rdd1 = ss.hiveContext.table(table1) val rdd1Filt = rdd1.filter(rdd1.col(sku) !== ) val rdd2 = ss.hiveContext.table(table2) val rddJoin = rdd1Filt.join(rdd2, rdd1Filt(receipt_id) === rdd2(receipt_id)) rddJoin.saveAsTable(testJoinTable, SaveMode.Overwrite) RDD workaround in this case is a bit cumbersome, for short, we just created 2 RDDs, join, and then apply a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org