Hi Michael,

Thanks so much for the reply - that really cleared a lot of things up for
me!

Let me just check that I've interpreted one of your suggestions for (4)
correctly... Would it make sense for me to write a small wrapper app that
pulls in hive-thriftserver as a dependency, iterates my Parquet directory
structure to discover "tables" and registers each as a temp table in some
context, before calling HiveThriftServer2.createWithContext as you suggest?

This would mean that to add new content, all I need to is restart that app,
which presumably could also be avoided fairly trivially by periodically
restarting the server with a new context internally. That certainly beats
manual curation of Hive table definitions, if it will work?


Thanks again,

James.

On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com> wrote:

> 1) What exactly is the relationship between the thrift server and Hive?
>> I'm guessing Spark is just making use of the Hive metastore to access table
>> definitions, and maybe some other things, is that the case?
>>
>
> Underneath the covers, the Spark SQL thrift server is executing queries
> using a HiveContext.  In this mode, nearly all computation is done with
> Spark SQL but we try to maintain compatibility with Hive wherever
> possible.  This means that you can write your queries in HiveQL, read
> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.
>
> The one exception here is Hive DDL operations (CREATE TABLE, etc).  These
> are passed directly to Hive code and executed there.  The Spark SQL DDL is
> sufficiently different that we always try to parse that first, and fall
> back to Hive when it does not parse.
>
> One possibly confusing point here, is that you can persist Spark SQL
> tables into the Hive metastore, but this is not the same as a Hive table.
> We are only use the metastore as a repo for metadata, but are not using
> their format for the information in this case (as we have datasources that
> hive does not understand, including things like schema auto discovery).
>
> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
> INT) SORTED AS PARQUET
> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
> hive: CREATE TABLE t USING parquet (path '/path/to/data')
>
>
>> 2) Am I therefore right in thinking that SQL queries sent to the thrift
>> server are still executed on the Spark cluster, using Spark SQL, and Hive
>> plays no active part in computation of results?
>>
>
> Correct.
>
> 3) What SQL flavour is actually supported by the Thrift Server? Is it
>> Spark SQL, Hive, or both? I've confused, because I've seen it accepting
>> Hive CREATE TABLE syntax, but Spark SQL seems to work too?
>>
>
> HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser
> by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do
> this.  The included SQL parser is mostly there for people who have
> dependency conflicts with Hive.
>
>
>> 4) When I run SQL queries using the Scala or Python shells, Spark seems
>> to figure out the schema by itself from my Parquet files very well, if I
>> use createTempTable on the DataFrame. It seems when running the thrift
>> server, I need to create a Hive table definition first? Is that the case,
>> or did I miss something? If it is, is there some sensible way to automate
>> this?
>>
>
> Temporary tables are only visible to the SQLContext that creates them.  If
> you want it to be visible to the server, you need to either start the
> thrift server with the same context your program is using
> (see HiveThriftServer2.createWithContext) or make a metastore table.  This
> can be done using Spark SQL DDL:
>
> CREATE TABLE t USING parquet (path '/path/to/data')
>
> Michael
>

Reply via email to