Hi,

We have 2 hive tables and want to join one with the other.

Initially, we ran a sql request on HiveContext. But it did not work. It was
blocked on 30/600 tasks.
Then we tried to load tables into two DataFrames, we have encountered the
same problem.
Finally, it works with RDD.join. What we have done is basically transforming
2 tables into 2 pair RDDs, then calling a join operation. It works great in
about 500 s. 

However, workaround is just a workaround, since we have to transform hive
tables into RDD. This is really annoying.

Just wondering whether the underlying code of DF/SQL's join operation is the
same as rdd's, knowing that there is a syntax analysis layer for DF/SQL,
while RDD's join is straightforward on two pair RDDs.

SQL request:
----------------------------------------------------------------------
select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount
from table1 as v1 left join table2 as v2
on v1.receipt_id = v2.receipt_id
where v1.sku != ""

DataFrame:
-----------------------------------------------------------------------------------------
val rdd1 = ss.hiveContext.table(table1)
val rdd1Filt = rdd1.filter(rdd1.col("sku") !== "")
val rdd2 = ss.hiveContext.table(table2)
val rddJoin = rdd1Filt.join(rdd2, rdd1Filt("receipt_id") ===
rdd2("receipt_id"))
rddJoin.saveAsTable("testJoinTable", SaveMode.Overwrite)

RDD workaround in this case is a bit cumbersome, for short, we just created
2 RDDs, join, and then apply a new schema on the result RDD. This approach
works, at least all tasks were finished, while the DF/SQL approach don't.

Any idea ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to