Hi Michael, Thanks so much for the reply - that really cleared a lot of things up for me!
Let me just check that I've interpreted one of your suggestions for (4) correctly... Would it make sense for me to write a small wrapper app that pulls in hive-thriftserver as a dependency, iterates my Parquet directory structure to discover "tables" and registers each as a temp table in some context, before calling HiveThriftServer2.createWithContext as you suggest? This would mean that to add new content, all I need to is restart that app, which presumably could also be avoided fairly trivially by periodically restarting the server with a new context internally. That certainly beats manual curation of Hive table definitions, if it will work? Thanks again, James. On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com> wrote: > 1) What exactly is the relationship between the thrift server and Hive? >> I'm guessing Spark is just making use of the Hive metastore to access table >> definitions, and maybe some other things, is that the case? >> > > Underneath the covers, the Spark SQL thrift server is executing queries > using a HiveContext. In this mode, nearly all computation is done with > Spark SQL but we try to maintain compatibility with Hive wherever > possible. This means that you can write your queries in HiveQL, read > tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. > > The one exception here is Hive DDL operations (CREATE TABLE, etc). These > are passed directly to Hive code and executed there. The Spark SQL DDL is > sufficiently different that we always try to parse that first, and fall > back to Hive when it does not parse. > > One possibly confusing point here, is that you can persist Spark SQL > tables into the Hive metastore, but this is not the same as a Hive table. > We are only use the metastore as a repo for metadata, but are not using > their format for the information in this case (as we have datasources that > hive does not understand, including things like schema auto discovery). > > HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x > INT) SORTED AS PARQUET > Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by > hive: CREATE TABLE t USING parquet (path '/path/to/data') > > >> 2) Am I therefore right in thinking that SQL queries sent to the thrift >> server are still executed on the Spark cluster, using Spark SQL, and Hive >> plays no active part in computation of results? >> > > Correct. > > 3) What SQL flavour is actually supported by the Thrift Server? Is it >> Spark SQL, Hive, or both? I've confused, because I've seen it accepting >> Hive CREATE TABLE syntax, but Spark SQL seems to work too? >> > > HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL parser > by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do > this. The included SQL parser is mostly there for people who have > dependency conflicts with Hive. > > >> 4) When I run SQL queries using the Scala or Python shells, Spark seems >> to figure out the schema by itself from my Parquet files very well, if I >> use createTempTable on the DataFrame. It seems when running the thrift >> server, I need to create a Hive table definition first? Is that the case, >> or did I miss something? If it is, is there some sensible way to automate >> this? >> > > Temporary tables are only visible to the SQLContext that creates them. If > you want it to be visible to the server, you need to either start the > thrift server with the same context your program is using > (see HiveThriftServer2.createWithContext) or make a metastore table. This > can be done using Spark SQL DDL: > > CREATE TABLE t USING parquet (path '/path/to/data') > > Michael >