I doubt it will work as expected. Note that hiveContext.hql("select ...").regAsTable("a") will create a SchemaRDD before register the SchemaRDD with the (Hive) catalog; While sqlContext.jsonFile("xxx").regAsTable("b") will create a SchemaRDD before register the SchemaRDD with the SparkSQL catalog(SimpleCatalog). The logic plans of the two SchemaRDDs are of the same type; but the physical plans are, and should be, different. The issue is that the transformation of the logical plans to physical plans are controlled by the "strategies" of "contexts"; namely the sqlContext transforms a logical plan to a physical plan suitable for SchemaRDD's execution from an in-memory data source, while HiveContext transforms a logical plan to a physical plan suitable for SchemaRDD's execution from a Hive data source. So
sqlContext.sql( a join b ) will generate a physical plan for the in-memory data source for both a and b; and hiveContext.sql(a join b) will generate a physical plan for Hive data source for both a and b. What's really needed is a storage transparency from the semantic layer if SparkSQL wants to go the data federation route. If one could manage to create a SchemaRDD on Hive data through just the SQLContext, not the HiveCOntext (being a subclass of SQLCOntext), seemingly hinted by the SparkSQL web page https://spark.apache.org/sql/ in the following code snippet: sqlCtx.jsonFile("s3n://...") .registerAsTable("json") schema_rdd = sqlCtx.sql(""" SELECT * FROM hiveTable JOIN json ...""") he/she might be able to perform the join of data sets of different types. I just have not tried. In terms of SQL-92 conforming, Presto might be better than HiveQL; while in terms of federation, Hive is actually very good at it. -----Original Message----- From: chutium [mailto:teng....@gmail.com] Sent: Thursday, August 21, 2014 4:35 AM To: d...@spark.incubator.apache.org Subject: Re: Spark SQL Query and join different data sources. as far as i know, HQL queries try to find the schema info of all the tables in this query from hive metastore, so it is not possible to join tables from sqlContext using hiveContext.hql but this should work: hiveContext.hql("select ...").regAsTable("a") sqlContext.jsonFile("xxx").regAsTable("b") then sqlContext.sql( a join b ) i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a new data source, but no predicate push down yet, also, it is not available for HQL so, if you are looking for something that can query different data sources with full SQL92 syntax, facebook presto is still the only choice, they have some kind of JDBC connector in deveopment, and there are some unofficial implementations... but i am looking forward to seeing the progress of Spark SQL, after SPARK-2179 SQLContext can handle any kind of structured data with a sequence of DataTypes as schema, although turning the data into Rows is still a little bit tricky... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org