RE: Spark SQL Query and join different data sources.

Yan Zhou.sc Thu, 21 Aug 2014 10:49:14 -0700

I doubt it will work as expected.

Note that hiveContext.hql("select ...").regAsTable("a") will create a SchemaRDD 
before register the SchemaRDD with the (Hive) catalog;
While sqlContext.jsonFile("xxx").regAsTable("b") will create a SchemaRDD before 
register the SchemaRDD with the SparkSQL catalog(SimpleCatalog).
The logic plans of the two SchemaRDDs are of the same type; but the physical 
plans are, and should be, different.
The issue is that the transformation of the logical plans to physical plans are 
controlled by the "strategies" of "contexts"; namely the sqlContext
transforms a logical plan to a physical plan suitable for SchemaRDD's execution 
from an in-memory data source, while HiveContext 
transforms a logical plan to a physical plan suitable for SchemaRDD's execution 
from a Hive data source. So


sqlContext.sql( a join b ) will generate a physical plan for the in-memory data 
source for both a and b; and
hiveContext.sql(a join b) will generate a physical plan for Hive data source 
for both a and b.

What's really needed is a storage transparency from the semantic layer if 
SparkSQL wants to go the data federation route.


If one could manage to create a SchemaRDD on Hive data through just the 
SQLContext, not the HiveCOntext (being a subclass of SQLCOntext), seemingly
hinted by the SparkSQL web page https://spark.apache.org/sql/ in the following 
code snippet:

sqlCtx.jsonFile("s3n://...")
  .registerAsTable("json")
 schema_rdd = sqlCtx.sql("""
   SELECT * 
   FROM hiveTable
   JOIN json ...""")

he/she might be able to perform the join of data sets of different types. I 
just have not tried.


In terms of SQL-92 conforming, Presto might be better than HiveQL; while in 
terms of federation, Hive is actually very good at it.




-----Original Message-----
From: chutium [mailto:[email protected]] 
Sent: Thursday, August 21, 2014 4:35 AM
To: [email protected]
Subject: Re: Spark SQL Query and join different data sources.

as far as i know, HQL queries try to find the schema info of all the tables in 
this query from hive metastore, so it is not possible to join tables from 
sqlContext using hiveContext.hql

but this should work:

hiveContext.hql("select ...").regAsTable("a")
sqlContext.jsonFile("xxx").regAsTable("b")

then

sqlContext.sql( a join b )


i created a ticket SPARK-2710 to add ResultSets from JDBC connection as a new 
data source, but no predicate push down yet, also, it is not available for HQL

so, if you are looking for something that can query different data sources with 
full SQL92 syntax, facebook presto is still the only choice, they have some 
kind of JDBC connector in deveopment, and there are some unofficial 
implementations...

but i am looking forward to seeing the progress of Spark SQL, after
SPARK-2179 SQLContext can handle any
kind of structured data with a sequence of DataTypes as schema, although 
turning the data into Rows is still a little bit tricky...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7937.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Spark SQL Query and join different data sources.

Reply via email to