Hi, We have 2 hive tables and want to join one with the other.
Initially, we ran a sql request on HiveContext. But it did not work. It was blocked on 30/600 tasks. Then we tried to load tables into two DataFrames, we have encountered the same problem. Finally, it works with RDD.join. What we have done is basically transforming 2 tables into 2 pair RDDs, then calling a join operation. It works great in about 500 s. However, workaround is just a workaround, since we have to transform hive tables into RDD. This is really annoying. Just wondering whether the underlying code of DF/SQL's join operation is the same as rdd's, knowing that there is a syntax analysis layer for DF/SQL, while RDD's join is straightforward on two pair RDDs. SQL request: ---------------------------------------------------------------------- select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount from table1 as v1 left join table2 as v2 on v1.receipt_id = v2.receipt_id where v1.sku != "" DataFrame: ----------------------------------------------------------------------------------------- val rdd1 = ss.hiveContext.table(table1) val rdd1Filt = rdd1.filter(rdd1.col("sku") !== "") val rdd2 = ss.hiveContext.table(table2) val rddJoin = rdd1Filt.join(rdd2, rdd1Filt("receipt_id") === rdd2("receipt_id")) rddJoin.saveAsTable("testJoinTable", SaveMode.Overwrite) RDD workaround in this case is a bit cumbersome, for short, we just created 2 RDDs, join, and then apply a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org