Re: Advice using Spark SQL and Thrift JDBC Server

Todd Nist Wed, 08 Apr 2015 05:50:20 -0700

To use the HiveThriftServer2.startWithContext, I thought one would use the
 following artifact in the build:


"org.apache.spark"    %% "spark-hive-thriftserver"   % "1.3.0"

But I am unable to resolve the artifact.  I do not see it in maven central
or any other repo.  Do I need to build Spark and publish locally or just
missing something obvious here?

Basic class is like this:

import org.apache.spark.{SparkConf, SparkContext}

import  org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveMetastoreTypes._
import org.apache.spark.sql.types._
import  org.apache.spark.sql.hive.thriftserver._

object MyThriftServer {

  val sparkConf = new SparkConf()
    // master is passed to spark-submit, but could also be specified explicitely
    // .setMaster(sparkMaster)
    .setAppName("My ThriftServer")
    .set("spark.cores.max", "2")
  val sc = new SparkContext(sparkConf)
  val  sparkContext  =  sc
  import  sparkContext._
  val  sqlContext  =  new  HiveContext(sparkContext)
  import  sqlContext._
  import sqlContext.implicits._

// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)
}

Build has the following:

scalaVersion := "2.10.4"

val SPARK_VERSION = "1.3.0"


libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION
      exclude("org.apache.spark", "spark-core_2.10")
      exclude("org.apache.spark", "spark-streaming_2.10")
      exclude("org.apache.spark", "spark-sql_2.10")
      exclude("javax.jms", "jms"),
    "org.apache.spark" %% "spark-core"  % SPARK_VERSION %  "provided",
    "org.apache.spark" %% "spark-streaming" % SPARK_VERSION %  "provided",
    "org.apache.spark"  %% "spark-sql"  % SPARK_VERSION % "provided",
    "org.apache.spark"  %% "spark-hive" % SPARK_VERSION % "provided",
    "org.apache.spark" %% "spark-hive-thriftserver"  % SPARK_VERSION   %
"provided",
    "org.apache.kafka" %% "kafka" % "0.8.1.1"
      exclude("javax.jms", "jms")
      exclude("com.sun.jdmk", "jmxtools")
      exclude("com.sun.jmx", "jmxri"),
    "joda-time" % "joda-time" % "2.7",
    "log4j" % "log4j" % "1.2.14"
      exclude("com.sun.jdmk", "jmxtools")
      exclude("com.sun.jmx", "jmxri")
  )

Appreciate the assistance.

-Todd

On Tue, Apr 7, 2015 at 4:09 PM, James Aley <james.a...@swiftkey.com> wrote:

> Excellent, thanks for your help, I appreciate your advice!
> On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote:
>
>> That should totally work.  The other option would be to run a persistent
>> metastore that multiple contexts can talk to and periodically run a job
>> that creates missing tables.  The trade-off here would be more complexity,
>> but less downtime due to the server restarting.
>>
>> On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com>
>> wrote:
>>
>>> Hi Michael,
>>>
>>> Thanks so much for the reply - that really cleared a lot of things up
>>> for me!
>>>
>>> Let me just check that I've interpreted one of your suggestions for (4)
>>> correctly... Would it make sense for me to write a small wrapper app that
>>> pulls in hive-thriftserver as a dependency, iterates my Parquet
>>> directory structure to discover "tables" and registers each as a temp table
>>> in some context, before calling HiveThriftServer2.createWithContext as
>>> you suggest?
>>>
>>> This would mean that to add new content, all I need to is restart that
>>> app, which presumably could also be avoided fairly trivially by
>>> periodically restarting the server with a new context internally. That
>>> certainly beats manual curation of Hive table definitions, if it will work?
>>>
>>>
>>> Thanks again,
>>>
>>> James.
>>>
>>> On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>>> 1) What exactly is the relationship between the thrift server and Hive?
>>>>> I'm guessing Spark is just making use of the Hive metastore to access 
>>>>> table
>>>>> definitions, and maybe some other things, is that the case?
>>>>>
>>>>
>>>> Underneath the covers, the Spark SQL thrift server is executing queries
>>>> using a HiveContext.  In this mode, nearly all computation is done with
>>>> Spark SQL but we try to maintain compatibility with Hive wherever
>>>> possible.  This means that you can write your queries in HiveQL, read
>>>> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.
>>>>
>>>> The one exception here is Hive DDL operations (CREATE TABLE, etc).
>>>> These are passed directly to Hive code and executed there.  The Spark SQL
>>>> DDL is sufficiently different that we always try to parse that first, and
>>>> fall back to Hive when it does not parse.
>>>>
>>>> One possibly confusing point here, is that you can persist Spark SQL
>>>> tables into the Hive metastore, but this is not the same as a Hive table.
>>>> We are only use the metastore as a repo for metadata, but are not using
>>>> their format for the information in this case (as we have datasources that
>>>> hive does not understand, including things like schema auto discovery).
>>>>
>>>> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
>>>> INT) SORTED AS PARQUET
>>>> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
>>>> hive: CREATE TABLE t USING parquet (path '/path/to/data')
>>>>
>>>>
>>>>> 2) Am I therefore right in thinking that SQL queries sent to the
>>>>> thrift server are still executed on the Spark cluster, using Spark SQL, 
>>>>> and
>>>>> Hive plays no active part in computation of results?
>>>>>
>>>>
>>>> Correct.
>>>>
>>>> 3) What SQL flavour is actually supported by the Thrift Server? Is it
>>>>> Spark SQL, Hive, or both? I've confused, because I've seen it accepting
>>>>> Hive CREATE TABLE syntax, but Spark SQL seems to work too?
>>>>>
>>>>
>>>> HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL
>>>> parser by `SET spark.sql.dialect=sql`, but honestly you probably don't want
>>>> to do this.  The included SQL parser is mostly there for people who have
>>>> dependency conflicts with Hive.
>>>>
>>>>
>>>>> 4) When I run SQL queries using the Scala or Python shells, Spark
>>>>> seems to figure out the schema by itself from my Parquet files very well,
>>>>> if I use createTempTable on the DataFrame. It seems when running the 
>>>>> thrift
>>>>> server, I need to create a Hive table definition first? Is that the case,
>>>>> or did I miss something? If it is, is there some sensible way to automate
>>>>> this?
>>>>>
>>>>
>>>> Temporary tables are only visible to the SQLContext that creates them.
>>>> If you want it to be visible to the server, you need to either start the
>>>> thrift server with the same context your program is using
>>>> (see HiveThriftServer2.createWithContext) or make a metastore table.  This
>>>> can be done using Spark SQL DDL:
>>>>
>>>> CREATE TABLE t USING parquet (path '/path/to/data')
>>>>
>>>> Michael
>>>>
>>>
>>>
>>

Re: Advice using Spark SQL and Thrift JDBC Server

Reply via email to