Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-09 Thread Todd Nist
Hi Mohammed,

Sorry, I guess I was not really clear in my response.  Yes sbt fails, the
-DskipTests is for mvn as I showed it in the example on how II built it.

I do not believe that -DskipTests has any impact in sbt, but could be
wrong.  sbt package should skip tests.  I did not try to track down where
the dependency was coming from.  Based on Patrick comments it sound like
this is now resolved.

Sorry for the confustion.


On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist wrote:

 Hi Mohammed,

 I think you just need to add -DskipTests to you build.  Here is how I
 built it:

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
 -DskipTests clean package install

 build/sbt does however fail even if only doing package which should skip

 I am able to build the MyThriftServer above now.

 Thanks Michael for the assistance.


 On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller


 Thank you!

 Looks like the sbt build is broken for 1.3. I downloaded the source code
 for 1.3, but I get the following error a few minutes after I run “sbt/sbt

 [error] (network-shuffle/*:update) sbt.ResolveException: unresolved
 dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
 not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It
 was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test

 [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM


 *From:* Michael Armbrust []
 *Sent:* Wednesday, April 8, 2015 11:54 AM
 *To:* Mohammed Guller
 *Cc:* Todd Nist; James Aley; user; Patrick Wendell

 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server

 Sorry guys.  I didn't realize that was not fixed yet.

 You can publish locally in the mean time (sbt/sbt publishLocal).

 On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller


 Interestingly, I ran into the exactly the same issue yesterday.  I
 couldn’t find any documentation about which project to include as a
 dependency in build.sbt to use HiveThriftServer2. Would appreciate help.


 *From:* Todd Nist []
 *Sent:* Wednesday, April 8, 2015 5:49 AM
 *To:* James Aley
 *Cc:* Michael Armbrust; user
 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server

 To use the HiveThriftServer2.startWithContext, I thought one would use
 the  following artifact in the build:

 org.apache.spark%% spark-hive-thriftserver   % 1.3.0

 But I am unable to resolve the artifact.  I do not see it in maven
 central or any other repo.  Do I need to build Spark and publish locally or
 just missing something obvious here?

 Basic class is like this:

 import org.apache.spark.{SparkConf, SparkContext}

 import  org.apache.spark.sql.hive.HiveContext

 import org.apache.spark.sql.hive.HiveMetastoreTypes._

 import org.apache.spark.sql.types._

 import  org.apache.spark.sql.hive.thriftserver._

 object MyThriftServer {

   val sparkConf = new SparkConf()

 // master is passed to spark-submit, but could also be specified 

 // .setMaster(sparkMaster)

 .setAppName(My ThriftServer)

 .set(spark.cores.max, 2)

   val sc = new SparkContext(sparkConf)

   val  sparkContext  =  sc

   import  sparkContext._

   val  sqlContext  =  new  HiveContext(sparkContext)

   import  sqlContext._

   import sqlContext.implicits._

 // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)


  Build has the following:

 scalaVersion := 2.10.4

 val SPARK_VERSION = 1.3.0

 libraryDependencies ++= Seq(

 org.apache.spark %% spark-streaming-kafka % SPARK_VERSION

   exclude(org.apache.spark, spark-core_2.10)

   exclude(org.apache.spark, spark-streaming_2.10)

   exclude(org.apache.spark, spark-sql_2.10)

   exclude(javax.jms, jms),

 org.apache.spark %% spark-core  % SPARK_VERSION %  provided,

 org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,

 org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,

 org.apache.spark  %% spark-hive % SPARK_VERSION % provided,

 org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %

 org.apache.kafka %% kafka %

   exclude(javax.jms, jms)

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri),

 joda-time % joda-time % 2.7,

 log4j % log4j % 1.2.14

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri)


 Appreciate the assistance.


 On Tue, Apr 7, 2015 at 4:09 PM, James Aley

 Excellent, thanks for your help, I appreciate your advice!

 On 7 Apr 2015 20:43, Michael Armbrust wrote:

 That should totally work.  The other option would be to run

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller

Interestingly, I ran into the exactly the same issue yesterday.  I couldn’t 
find any documentation about which project to include as a dependency in 
build.sbt to use HiveThriftServer2. Would appreciate help.


From: Todd Nist []
Sent: Wednesday, April 8, 2015 5:49 AM
To: James Aley
Cc: Michael Armbrust; user
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

To use the HiveThriftServer2.startWithContext, I thought one would use the  
following artifact in the build:

org.apache.spark%% spark-hive-thriftserver   % 1.3.0

But I am unable to resolve the artifact.  I do not see it in maven central or 
any other repo.  Do I need to build Spark and publish locally or just missing 
something obvious here?

Basic class is like this:

import org.apache.spark.{SparkConf, SparkContext}

import  org.apache.spark.sql.hive.HiveContext

import org.apache.spark.sql.hive.HiveMetastoreTypes._

import org.apache.spark.sql.types._

import  org.apache.spark.sql.hive.thriftserver._

object MyThriftServer {

  val sparkConf = new SparkConf()

// master is passed to spark-submit, but could also be specified explicitely

// .setMaster(sparkMaster)

.setAppName(My ThriftServer)

.set(spark.cores.max, 2)

  val sc = new SparkContext(sparkConf)

  val  sparkContext  =  sc

  import  sparkContext._

  val  sqlContext  =  new  HiveContext(sparkContext)

  import  sqlContext._

  import sqlContext.implicits._

// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

Build has the following:

scalaVersion := 2.10.4

val SPARK_VERSION = 1.3.0

libraryDependencies ++= Seq(
org.apache.spark %% spark-streaming-kafka % SPARK_VERSION
  exclude(org.apache.spark, spark-core_2.10)
  exclude(org.apache.spark, spark-streaming_2.10)
  exclude(org.apache.spark, spark-sql_2.10)
  exclude(javax.jms, jms),
org.apache.spark %% spark-core  % SPARK_VERSION %  provided,
org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,
org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,
org.apache.spark  %% spark-hive % SPARK_VERSION % provided,
org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   % 
org.apache.kafka %% kafka %
  exclude(javax.jms, jms)
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri),
joda-time % joda-time % 2.7,
log4j % log4j % 1.2.14
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri)

Appreciate the assistance.


On Tue, Apr 7, 2015 at 4:09 PM, James Aley wrote:

Excellent, thanks for your help, I appreciate your advice!
On 7 Apr 2015 20:43, Michael Armbrust wrote:
That should totally work.  The other option would be to run a persistent 
metastore that multiple contexts can talk to and periodically run a job that 
creates missing tables.  The trade-off here would be more complexity, but less 
downtime due to the server restarting.

On Tue, Apr 7, 2015 at 12:34 PM, James Aley wrote:
Hi Michael,

Thanks so much for the reply - that really cleared a lot of things up for me!

Let me just check that I've interpreted one of your suggestions for (4) 
correctly... Would it make sense for me to write a small wrapper app that pulls 
in hive-thriftserver as a dependency, iterates my Parquet directory structure 
to discover tables and registers each as a temp table in some context, before 
calling HiveThriftServer2.createWithContext as you suggest?

This would mean that to add new content, all I need to is restart that app, 
which presumably could also be avoided fairly trivially by periodically 
restarting the server with a new context internally. That certainly beats 
manual curation of Hive table definitions, if it will work?

Thanks again,


On 7 April 2015 at 19:30, Michael Armbrust wrote:
1) What exactly is the relationship between the thrift server and Hive? I'm 
guessing Spark is just making use of the Hive metastore to access table 
definitions, and maybe some other things, is that the case?

Underneath the covers, the Spark SQL thrift server is executing queries using a 
HiveContext.  In this mode, nearly all computation is done with Spark SQL but 
we try to maintain compatibility with Hive wherever possible.  This means that 
you can write your queries in HiveQL, read tables from the Hive metastore, and 
use Hive UDFs UDTs UDAFs, etc.

The one exception here is Hive DDL operations (CREATE TABLE, etc).  These are 
passed directly to Hive code and executed there.  The Spark SQL DDL is 
sufficiently different that we always try to parse that first, and fall back to 
Hive when it does not parse.

One possibly confusing point here, is that you can persist Spark SQL tables

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist
To use the HiveThriftServer2.startWithContext, I thought one would use the
 following artifact in the build:

org.apache.spark%% spark-hive-thriftserver   % 1.3.0

But I am unable to resolve the artifact.  I do not see it in maven central
or any other repo.  Do I need to build Spark and publish locally or just
missing something obvious here?

Basic class is like this:

import org.apache.spark.{SparkConf, SparkContext}

import  org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveMetastoreTypes._
import org.apache.spark.sql.types._
import  org.apache.spark.sql.hive.thriftserver._

object MyThriftServer {

  val sparkConf = new SparkConf()
// master is passed to spark-submit, but could also be specified explicitely
// .setMaster(sparkMaster)
.setAppName(My ThriftServer)
.set(spark.cores.max, 2)
  val sc = new SparkContext(sparkConf)
  val  sparkContext  =  sc
  import  sparkContext._
  val  sqlContext  =  new  HiveContext(sparkContext)
  import  sqlContext._
  import sqlContext.implicits._

// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

Build has the following:

scalaVersion := 2.10.4

val SPARK_VERSION = 1.3.0

libraryDependencies ++= Seq(
org.apache.spark %% spark-streaming-kafka % SPARK_VERSION
  exclude(org.apache.spark, spark-core_2.10)
  exclude(org.apache.spark, spark-streaming_2.10)
  exclude(org.apache.spark, spark-sql_2.10)
  exclude(javax.jms, jms),
org.apache.spark %% spark-core  % SPARK_VERSION %  provided,
org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,
org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,
org.apache.spark  %% spark-hive % SPARK_VERSION % provided,
org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %
org.apache.kafka %% kafka %
  exclude(javax.jms, jms)
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri),
joda-time % joda-time % 2.7,
log4j % log4j % 1.2.14
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri)

Appreciate the assistance.


On Tue, Apr 7, 2015 at 4:09 PM, James Aley wrote:

 Excellent, thanks for your help, I appreciate your advice!
 On 7 Apr 2015 20:43, Michael Armbrust wrote:

 That should totally work.  The other option would be to run a persistent
 metastore that multiple contexts can talk to and periodically run a job
 that creates missing tables.  The trade-off here would be more complexity,
 but less downtime due to the server restarting.

 On Tue, Apr 7, 2015 at 12:34 PM, James Aley

 Hi Michael,

 Thanks so much for the reply - that really cleared a lot of things up
 for me!

 Let me just check that I've interpreted one of your suggestions for (4)
 correctly... Would it make sense for me to write a small wrapper app that
 pulls in hive-thriftserver as a dependency, iterates my Parquet
 directory structure to discover tables and registers each as a temp table
 in some context, before calling HiveThriftServer2.createWithContext as
 you suggest?

 This would mean that to add new content, all I need to is restart that
 app, which presumably could also be avoided fairly trivially by
 periodically restarting the server with a new context internally. That
 certainly beats manual curation of Hive table definitions, if it will work?

 Thanks again,


 On 7 April 2015 at 19:30, Michael Armbrust

 1) What exactly is the relationship between the thrift server and Hive?
 I'm guessing Spark is just making use of the Hive metastore to access 
 definitions, and maybe some other things, is that the case?

 Underneath the covers, the Spark SQL thrift server is executing queries
 using a HiveContext.  In this mode, nearly all computation is done with
 Spark SQL but we try to maintain compatibility with Hive wherever
 possible.  This means that you can write your queries in HiveQL, read
 tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

 The one exception here is Hive DDL operations (CREATE TABLE, etc).
 These are passed directly to Hive code and executed there.  The Spark SQL
 DDL is sufficiently different that we always try to parse that first, and
 fall back to Hive when it does not parse.

 One possibly confusing point here, is that you can persist Spark SQL
 tables into the Hive metastore, but this is not the same as a Hive table.
 We are only use the metastore as a repo for metadata, but are not using
 their format for the information in this case (as we have datasources that
 hive does not understand, including things like schema auto discovery).

 HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
 Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
 hive: CREATE TABLE t USING parquet (path '/path/to/data')

 2) Am I therefore 

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Michael Armbrust
Sorry guys.  I didn't realize that was not fixed yet.

You can publish locally in the mean time (sbt/sbt publishLocal).

On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller


 Interestingly, I ran into the exactly the same issue yesterday.  I
 couldn’t find any documentation about which project to include as a
 dependency in build.sbt to use HiveThriftServer2. Would appreciate help.


 *From:* Todd Nist []
 *Sent:* Wednesday, April 8, 2015 5:49 AM
 *To:* James Aley
 *Cc:* Michael Armbrust; user
 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server

 To use the HiveThriftServer2.startWithContext, I thought one would use the
  following artifact in the build:

 org.apache.spark%% spark-hive-thriftserver   % 1.3.0

 But I am unable to resolve the artifact.  I do not see it in maven central
 or any other repo.  Do I need to build Spark and publish locally or just
 missing something obvious here?

 Basic class is like this:

 import org.apache.spark.{SparkConf, SparkContext}

 import  org.apache.spark.sql.hive.HiveContext

 import org.apache.spark.sql.hive.HiveMetastoreTypes._

 import org.apache.spark.sql.types._

 import  org.apache.spark.sql.hive.thriftserver._

 object MyThriftServer {

   val sparkConf = new SparkConf()

 // master is passed to spark-submit, but could also be specified 

 // .setMaster(sparkMaster)

 .setAppName(My ThriftServer)

 .set(spark.cores.max, 2)

   val sc = new SparkContext(sparkConf)

   val  sparkContext  =  sc

   import  sparkContext._

   val  sqlContext  =  new  HiveContext(sparkContext)

   import  sqlContext._

   import sqlContext.implicits._

 // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)


  Build has the following:

 scalaVersion := 2.10.4

 val SPARK_VERSION = 1.3.0

 libraryDependencies ++= Seq(

 org.apache.spark %% spark-streaming-kafka % SPARK_VERSION

   exclude(org.apache.spark, spark-core_2.10)

   exclude(org.apache.spark, spark-streaming_2.10)

   exclude(org.apache.spark, spark-sql_2.10)

   exclude(javax.jms, jms),

 org.apache.spark %% spark-core  % SPARK_VERSION %  provided,

 org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,

 org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,

 org.apache.spark  %% spark-hive % SPARK_VERSION % provided,

 org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %

 org.apache.kafka %% kafka %

   exclude(javax.jms, jms)

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri),

 joda-time % joda-time % 2.7,

 log4j % log4j % 1.2.14

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri)


 Appreciate the assistance.


 On Tue, Apr 7, 2015 at 4:09 PM, James Aley

 Excellent, thanks for your help, I appreciate your advice!

 On 7 Apr 2015 20:43, Michael Armbrust wrote:

 That should totally work.  The other option would be to run a persistent
 metastore that multiple contexts can talk to and periodically run a job
 that creates missing tables.  The trade-off here would be more complexity,
 but less downtime due to the server restarting.

 On Tue, Apr 7, 2015 at 12:34 PM, James Aley

 Hi Michael,

 Thanks so much for the reply - that really cleared a lot of things up for

 Let me just check that I've interpreted one of your suggestions for (4)
 correctly... Would it make sense for me to write a small wrapper app that
 pulls in hive-thriftserver as a dependency, iterates my Parquet directory
 structure to discover tables and registers each as a temp table in some
 context, before calling HiveThriftServer2.createWithContext as you suggest?

 This would mean that to add new content, all I need to is restart that
 app, which presumably could also be avoided fairly trivially by
 periodically restarting the server with a new context internally. That
 certainly beats manual curation of Hive table definitions, if it will work?

 Thanks again,


 On 7 April 2015 at 19:30, Michael Armbrust wrote:

  1) What exactly is the relationship between the thrift server and Hive?
 I'm guessing Spark is just making use of the Hive metastore to access table
 definitions, and maybe some other things, is that the case?

 Underneath the covers, the Spark SQL thrift server is executing queries
 using a HiveContext.  In this mode, nearly all computation is done with
 Spark SQL but we try to maintain compatibility with Hive wherever
 possible.  This means that you can write your queries in HiveQL, read
 tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

 The one exception here is Hive DDL operations (CREATE TABLE

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
Thank you!

Looks like the sbt build is broken for 1.3. I downloaded the source code for 
1.3, but I get the following error a few minutes after I run “sbt/sbt 

[error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.spark#spark-network-common_2.10;1.3.0: configuration not public in 
org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was required from 
org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM


From: Michael Armbrust []
Sent: Wednesday, April 8, 2015 11:54 AM
To: Mohammed Guller
Cc: Todd Nist; James Aley; user; Patrick Wendell
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

Sorry guys.  I didn't realize that was not fixed yet.

You can publish locally in the mean time (sbt/sbt publishLocal).

On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller wrote:

Interestingly, I ran into the exactly the same issue yesterday.  I couldn’t 
find any documentation about which project to include as a dependency in 
build.sbt to use HiveThriftServer2. Would appreciate help.


From: Todd Nist []
Sent: Wednesday, April 8, 2015 5:49 AM
To: James Aley
Cc: Michael Armbrust; user
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

To use the HiveThriftServer2.startWithContext, I thought one would use the  
following artifact in the build:

org.apache.spark%% spark-hive-thriftserver   % 1.3.0

But I am unable to resolve the artifact.  I do not see it in maven central or 
any other repo.  Do I need to build Spark and publish locally or just missing 
something obvious here?

Basic class is like this:

import org.apache.spark.{SparkConf, SparkContext}

import  org.apache.spark.sql.hive.HiveContext

import org.apache.spark.sql.hive.HiveMetastoreTypes._

import org.apache.spark.sql.types._

import  org.apache.spark.sql.hive.thriftserver._

object MyThriftServer {

  val sparkConf = new SparkConf()

// master is passed to spark-submit, but could also be specified explicitely

// .setMaster(sparkMaster)

.setAppName(My ThriftServer)

.set(spark.cores.max, 2)

  val sc = new SparkContext(sparkConf)

  val  sparkContext  =  sc

  import  sparkContext._

  val  sqlContext  =  new  HiveContext(sparkContext)

  import  sqlContext._

  import sqlContext.implicits._

// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

Build has the following:

scalaVersion := 2.10.4

val SPARK_VERSION = 1.3.0

libraryDependencies ++= Seq(
org.apache.spark %% spark-streaming-kafka % SPARK_VERSION
  exclude(org.apache.spark, spark-core_2.10)
  exclude(org.apache.spark, spark-streaming_2.10)
  exclude(org.apache.spark, spark-sql_2.10)
  exclude(javax.jms, jms),
org.apache.spark %% spark-core  % SPARK_VERSION %  provided,
org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,
org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,
org.apache.spark  %% spark-hive % SPARK_VERSION % provided,
org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   % 
org.apache.kafka %% kafka %
  exclude(javax.jms, jms)
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri),
joda-time % joda-time % 2.7,
log4j % log4j % 1.2.14
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri)

Appreciate the assistance.


On Tue, Apr 7, 2015 at 4:09 PM, James Aley wrote:

Excellent, thanks for your help, I appreciate your advice!
On 7 Apr 2015 20:43, Michael Armbrust wrote:
That should totally work.  The other option would be to run a persistent 
metastore that multiple contexts can talk to and periodically run a job that 
creates missing tables.  The trade-off here would be more complexity, but less 
downtime due to the server restarting.

On Tue, Apr 7, 2015 at 12:34 PM, James Aley wrote:
Hi Michael,

Thanks so much for the reply - that really cleared a lot of things up for me!

Let me just check that I've interpreted one of your suggestions for (4) 
correctly... Would it make sense for me to write a small wrapper app that pulls 
in hive-thriftserver as a dependency, iterates my Parquet directory structure 
to discover tables and registers each as a temp table in some context, before 
calling HiveThriftServer2.createWithContext as you suggest?

This would mean that to add new content, all I need to is restart that app, 
which presumably could also be avoided fairly trivially by periodically 
restarting the server with a new context internally. That certainly beats 
manual curation

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller
Hey Patrick, Michael and Todd,
Thank you for your help!

As you guys recommended, I did  a local install and got my code to compile.

As an FYI, on my local machine the sbt build fails even if I add –DskipTests. 
So I used mvn.


From: Patrick Wendell []
Sent: Wednesday, April 8, 2015 6:16 PM
To: Todd Nist
Cc: Mohammed Guller; Michael Armbrust; James Aley; user
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

Hey Guys,

Someone submitted a patch for this just now. It's a very simple fix and we can 
merge it soon. However, it's just missed our timeline for Spark 1.3.1, so the 
upstream thing won't get fully published until 1.3.2. However, you can always 
just install locally and build against your local install.

- Patrick

On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist wrote:
Hi Mohammed,

I think you just need to add -DskipTests to you build.  Here is how I built it:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
-DskipTests clean package install

build/sbt does however fail even if only doing package which should skip tests.

I am able to build the MyThriftServer above now.

Thanks Michael for the assistance.


On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller wrote:
Thank you!

Looks like the sbt build is broken for 1.3. I downloaded the source code for 
1.3, but I get the following error a few minutes after I run “sbt/sbt 

[error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.spark#spark-network-common_2.10;1.3.0: configuration not public in 
org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was required from 
org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM


From: Michael Armbrust 
Sent: Wednesday, April 8, 2015 11:54 AM
To: Mohammed Guller
Cc: Todd Nist; James Aley; user; Patrick Wendell

Subject: Re: Advice using Spark SQL and Thrift JDBC Server

Sorry guys.  I didn't realize that was not fixed yet.

You can publish locally in the mean time (sbt/sbt publishLocal).

On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller wrote:

Interestingly, I ran into the exactly the same issue yesterday.  I couldn’t 
find any documentation about which project to include as a dependency in 
build.sbt to use HiveThriftServer2. Would appreciate help.


From: Todd Nist []
Sent: Wednesday, April 8, 2015 5:49 AM
To: James Aley
Cc: Michael Armbrust; user
Subject: Re: Advice using Spark SQL and Thrift JDBC Server

To use the HiveThriftServer2.startWithContext, I thought one would use the  
following artifact in the build:

org.apache.spark%% spark-hive-thriftserver   % 1.3.0

But I am unable to resolve the artifact.  I do not see it in maven central or 
any other repo.  Do I need to build Spark and publish locally or just missing 
something obvious here?

Basic class is like this:

import org.apache.spark.{SparkConf, SparkContext}

import  org.apache.spark.sql.hive.HiveContext

import org.apache.spark.sql.hive.HiveMetastoreTypes._

import org.apache.spark.sql.types._

import  org.apache.spark.sql.hive.thriftserver._

object MyThriftServer {

  val sparkConf = new SparkConf()

// master is passed to spark-submit, but could also be specified explicitely

// .setMaster(sparkMaster)

.setAppName(My ThriftServer)

.set(spark.cores.max, 2)

  val sc = new SparkContext(sparkConf)

  val  sparkContext  =  sc

  import  sparkContext._

  val  sqlContext  =  new  HiveContext(sparkContext)

  import  sqlContext._

  import sqlContext.implicits._

// register temp tables here   HiveThriftServer2.startWithContext(sqlContext)

Build has the following:

scalaVersion := 2.10.4

val SPARK_VERSION = 1.3.0

libraryDependencies ++= Seq(
org.apache.spark %% spark-streaming-kafka % SPARK_VERSION
  exclude(org.apache.spark, spark-core_2.10)
  exclude(org.apache.spark, spark-streaming_2.10)
  exclude(org.apache.spark, spark-sql_2.10)
  exclude(javax.jms, jms),
org.apache.spark %% spark-core  % SPARK_VERSION %  provided,
org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,
org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,
org.apache.spark  %% spark-hive % SPARK_VERSION % provided,
org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   % 
org.apache.kafka %% kafka %
  exclude(javax.jms, jms)
  exclude(com.sun.jdmk, jmxtools)
  exclude(com.sun.jmx, jmxri),
joda-time % joda-time % 2.7,
log4j % log4j % 1.2.14
  exclude(com.sun.jdmk, jmxtools)

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist
Hi Mohammed,

I think you just need to add -DskipTests to you build.  Here is how I built

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
-DskipTests clean package install

build/sbt does however fail even if only doing package which should skip

I am able to build the MyThriftServer above now.

Thanks Michael for the assistance.


On Wed, Apr 8, 2015 at 3:39 PM, Mohammed Guller


 Thank you!

 Looks like the sbt build is broken for 1.3. I downloaded the source code
 for 1.3, but I get the following error a few minutes after I run “sbt/sbt

 [error] (network-shuffle/*:update) sbt.ResolveException: unresolved
 dependency: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
 not public in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It
 was required from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test

 [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM


 *From:* Michael Armbrust []
 *Sent:* Wednesday, April 8, 2015 11:54 AM
 *To:* Mohammed Guller
 *Cc:* Todd Nist; James Aley; user; Patrick Wendell

 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server

 Sorry guys.  I didn't realize that was not fixed yet.

 You can publish locally in the mean time (sbt/sbt publishLocal).

 On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller


 Interestingly, I ran into the exactly the same issue yesterday.  I
 couldn’t find any documentation about which project to include as a
 dependency in build.sbt to use HiveThriftServer2. Would appreciate help.


 *From:* Todd Nist []
 *Sent:* Wednesday, April 8, 2015 5:49 AM
 *To:* James Aley
 *Cc:* Michael Armbrust; user
 *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server

 To use the HiveThriftServer2.startWithContext, I thought one would use the
  following artifact in the build:

 org.apache.spark%% spark-hive-thriftserver   % 1.3.0

 But I am unable to resolve the artifact.  I do not see it in maven central
 or any other repo.  Do I need to build Spark and publish locally or just
 missing something obvious here?

 Basic class is like this:

 import org.apache.spark.{SparkConf, SparkContext}

 import  org.apache.spark.sql.hive.HiveContext

 import org.apache.spark.sql.hive.HiveMetastoreTypes._

 import org.apache.spark.sql.types._

 import  org.apache.spark.sql.hive.thriftserver._

 object MyThriftServer {

   val sparkConf = new SparkConf()

 // master is passed to spark-submit, but could also be specified 

 // .setMaster(sparkMaster)

 .setAppName(My ThriftServer)

 .set(spark.cores.max, 2)

   val sc = new SparkContext(sparkConf)

   val  sparkContext  =  sc

   import  sparkContext._

   val  sqlContext  =  new  HiveContext(sparkContext)

   import  sqlContext._

   import sqlContext.implicits._

 // register temp tables here   HiveThriftServer2.startWithContext(sqlContext)


  Build has the following:

 scalaVersion := 2.10.4

 val SPARK_VERSION = 1.3.0

 libraryDependencies ++= Seq(

 org.apache.spark %% spark-streaming-kafka % SPARK_VERSION

   exclude(org.apache.spark, spark-core_2.10)

   exclude(org.apache.spark, spark-streaming_2.10)

   exclude(org.apache.spark, spark-sql_2.10)

   exclude(javax.jms, jms),

 org.apache.spark %% spark-core  % SPARK_VERSION %  provided,

 org.apache.spark %% spark-streaming % SPARK_VERSION %  provided,

 org.apache.spark  %% spark-sql  % SPARK_VERSION % provided,

 org.apache.spark  %% spark-hive % SPARK_VERSION % provided,

 org.apache.spark %% spark-hive-thriftserver  % SPARK_VERSION   %

 org.apache.kafka %% kafka %

   exclude(javax.jms, jms)

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri),

 joda-time % joda-time % 2.7,

 log4j % log4j % 1.2.14

   exclude(com.sun.jdmk, jmxtools)

   exclude(com.sun.jmx, jmxri)


 Appreciate the assistance.


 On Tue, Apr 7, 2015 at 4:09 PM, James Aley

 Excellent, thanks for your help, I appreciate your advice!

 On 7 Apr 2015 20:43, Michael Armbrust wrote:

 That should totally work.  The other option would be to run a persistent
 metastore that multiple contexts can talk to and periodically run a job
 that creates missing tables.  The trade-off here would be more complexity,
 but less downtime due to the server restarting.

 On Tue, Apr 7, 2015 at 12:34 PM, James Aley

 Hi Michael,

 Thanks so much for the reply - that really cleared a lot of things up for

 Let me just check that I've interpreted one of your suggestions for (4)
 correctly... Would it make sense for me to write

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust

 1) What exactly is the relationship between the thrift server and Hive?
 I'm guessing Spark is just making use of the Hive metastore to access table
 definitions, and maybe some other things, is that the case?

Underneath the covers, the Spark SQL thrift server is executing queries
using a HiveContext.  In this mode, nearly all computation is done with
Spark SQL but we try to maintain compatibility with Hive wherever
possible.  This means that you can write your queries in HiveQL, read
tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

The one exception here is Hive DDL operations (CREATE TABLE, etc).  These
are passed directly to Hive code and executed there.  The Spark SQL DDL is
sufficiently different that we always try to parse that first, and fall
back to Hive when it does not parse.

One possibly confusing point here, is that you can persist Spark SQL tables
into the Hive metastore, but this is not the same as a Hive table.  We are
only use the metastore as a repo for metadata, but are not using their
format for the information in this case (as we have datasources that hive
does not understand, including things like schema auto discovery).

HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
hive: CREATE TABLE t USING parquet (path '/path/to/data')

 2) Am I therefore right in thinking that SQL queries sent to the thrift
 server are still executed on the Spark cluster, using Spark SQL, and Hive
 plays no active part in computation of results?


3) What SQL flavour is actually supported by the Thrift Server? Is it Spark
 SQL, Hive, or both? I've confused, because I've seen it accepting Hive
 CREATE TABLE syntax, but Spark SQL seems to work too?

HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser
by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do
this.  The included SQL parser is mostly there for people who have
dependency conflicts with Hive.

 4) When I run SQL queries using the Scala or Python shells, Spark seems to
 figure out the schema by itself from my Parquet files very well, if I use
 createTempTable on the DataFrame. It seems when running the thrift server,
 I need to create a Hive table definition first? Is that the case, or did I
 miss something? If it is, is there some sensible way to automate this?

Temporary tables are only visible to the SQLContext that creates them.  If
you want it to be visible to the server, you need to either start the
thrift server with the same context your program is using
(see HiveThriftServer2.createWithContext) or make a metastore table.  This
can be done using Spark SQL DDL:

CREATE TABLE t USING parquet (path '/path/to/data')


Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
That should totally work.  The other option would be to run a persistent
metastore that multiple contexts can talk to and periodically run a job
that creates missing tables.  The trade-off here would be more complexity,
but less downtime due to the server restarting.

On Tue, Apr 7, 2015 at 12:34 PM, James Aley wrote:

 Hi Michael,

 Thanks so much for the reply - that really cleared a lot of things up for

 Let me just check that I've interpreted one of your suggestions for (4)
 correctly... Would it make sense for me to write a small wrapper app that
 pulls in hive-thriftserver as a dependency, iterates my Parquet directory
 structure to discover tables and registers each as a temp table in some
 context, before calling HiveThriftServer2.createWithContext as you

 This would mean that to add new content, all I need to is restart that
 app, which presumably could also be avoided fairly trivially by
 periodically restarting the server with a new context internally. That
 certainly beats manual curation of Hive table definitions, if it will work?

 Thanks again,


 On 7 April 2015 at 19:30, Michael Armbrust wrote:

 1) What exactly is the relationship between the thrift server and Hive?
 I'm guessing Spark is just making use of the Hive metastore to access table
 definitions, and maybe some other things, is that the case?

 Underneath the covers, the Spark SQL thrift server is executing queries
 using a HiveContext.  In this mode, nearly all computation is done with
 Spark SQL but we try to maintain compatibility with Hive wherever
 possible.  This means that you can write your queries in HiveQL, read
 tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

 The one exception here is Hive DDL operations (CREATE TABLE, etc).  These
 are passed directly to Hive code and executed there.  The Spark SQL DDL is
 sufficiently different that we always try to parse that first, and fall
 back to Hive when it does not parse.

 One possibly confusing point here, is that you can persist Spark SQL
 tables into the Hive metastore, but this is not the same as a Hive table.
 We are only use the metastore as a repo for metadata, but are not using
 their format for the information in this case (as we have datasources that
 hive does not understand, including things like schema auto discovery).

 HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
 Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
 hive: CREATE TABLE t USING parquet (path '/path/to/data')

 2) Am I therefore right in thinking that SQL queries sent to the thrift
 server are still executed on the Spark cluster, using Spark SQL, and Hive
 plays no active part in computation of results?


 3) What SQL flavour is actually supported by the Thrift Server? Is it
 Spark SQL, Hive, or both? I've confused, because I've seen it accepting
 Hive CREATE TABLE syntax, but Spark SQL seems to work too?

 HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser
 by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do
 this.  The included SQL parser is mostly there for people who have
 dependency conflicts with Hive.

 4) When I run SQL queries using the Scala or Python shells, Spark seems
 to figure out the schema by itself from my Parquet files very well, if I
 use createTempTable on the DataFrame. It seems when running the thrift
 server, I need to create a Hive table definition first? Is that the case,
 or did I miss something? If it is, is there some sensible way to automate

 Temporary tables are only visible to the SQLContext that creates them.
 If you want it to be visible to the server, you need to either start the
 thrift server with the same context your program is using
 (see HiveThriftServer2.createWithContext) or make a metastore table.  This
 can be done using Spark SQL DDL:

 CREATE TABLE t USING parquet (path '/path/to/data')


Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hi Michael,

Thanks so much for the reply - that really cleared a lot of things up for

Let me just check that I've interpreted one of your suggestions for (4)
correctly... Would it make sense for me to write a small wrapper app that
pulls in hive-thriftserver as a dependency, iterates my Parquet directory
structure to discover tables and registers each as a temp table in some
context, before calling HiveThriftServer2.createWithContext as you suggest?

This would mean that to add new content, all I need to is restart that app,
which presumably could also be avoided fairly trivially by periodically
restarting the server with a new context internally. That certainly beats
manual curation of Hive table definitions, if it will work?

Thanks again,


On 7 April 2015 at 19:30, Michael Armbrust wrote:

 1) What exactly is the relationship between the thrift server and Hive?
 I'm guessing Spark is just making use of the Hive metastore to access table
 definitions, and maybe some other things, is that the case?

 Underneath the covers, the Spark SQL thrift server is executing queries
 using a HiveContext.  In this mode, nearly all computation is done with
 Spark SQL but we try to maintain compatibility with Hive wherever
 possible.  This means that you can write your queries in HiveQL, read
 tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc.

 The one exception here is Hive DDL operations (CREATE TABLE, etc).  These
 are passed directly to Hive code and executed there.  The Spark SQL DDL is
 sufficiently different that we always try to parse that first, and fall
 back to Hive when it does not parse.

 One possibly confusing point here, is that you can persist Spark SQL
 tables into the Hive metastore, but this is not the same as a Hive table.
 We are only use the metastore as a repo for metadata, but are not using
 their format for the information in this case (as we have datasources that
 hive does not understand, including things like schema auto discovery).

 HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x
 Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by
 hive: CREATE TABLE t USING parquet (path '/path/to/data')

 2) Am I therefore right in thinking that SQL queries sent to the thrift
 server are still executed on the Spark cluster, using Spark SQL, and Hive
 plays no active part in computation of results?


 3) What SQL flavour is actually supported by the Thrift Server? Is it
 Spark SQL, Hive, or both? I've confused, because I've seen it accepting
 Hive CREATE TABLE syntax, but Spark SQL seems to work too?

 HiveQL++ (with Spark SQL DDL).  You can make it use our simple SQL parser
 by `SET spark.sql.dialect=sql`, but honestly you probably don't want to do
 this.  The included SQL parser is mostly there for people who have
 dependency conflicts with Hive.

 4) When I run SQL queries using the Scala or Python shells, Spark seems
 to figure out the schema by itself from my Parquet files very well, if I
 use createTempTable on the DataFrame. It seems when running the thrift
 server, I need to create a Hive table definition first? Is that the case,
 or did I miss something? If it is, is there some sensible way to automate

 Temporary tables are only visible to the SQLContext that creates them.  If
 you want it to be visible to the server, you need to either start the
 thrift server with the same context your program is using
 (see HiveThriftServer2.createWithContext) or make a metastore table.  This
 can be done using Spark SQL DDL:

 CREATE TABLE t USING parquet (path '/path/to/data')
