To use the HiveThriftServer2.startWithContext, I thought one would use the following artifact in the build:
"org.apache.spark" %% "spark-hive-thriftserver" % "1.3.0" But I am unable to resolve the artifact. I do not see it in maven central or any other repo. Do I need to build Spark and publish locally or just missing something obvious here? Basic class is like this: import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveMetastoreTypes._ import org.apache.spark.sql.types._ import org.apache.spark.sql.hive.thriftserver._ object MyThriftServer { val sparkConf = new SparkConf() // master is passed to spark-submit, but could also be specified explicitely // .setMaster(sparkMaster) .setAppName("My ThriftServer") .set("spark.cores.max", "2") val sc = new SparkContext(sparkConf) val sparkContext = sc import sparkContext._ val sqlContext = new HiveContext(sparkContext) import sqlContext._ import sqlContext.implicits._ // register temp tables here HiveThriftServer2.startWithContext(sqlContext) } Build has the following: scalaVersion := "2.10.4" val SPARK_VERSION = "1.3.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming-kafka" % SPARK_VERSION exclude("org.apache.spark", "spark-core_2.10") exclude("org.apache.spark", "spark-streaming_2.10") exclude("org.apache.spark", "spark-sql_2.10") exclude("javax.jms", "jms"), "org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-streaming" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-sql" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-hive" % SPARK_VERSION % "provided", "org.apache.spark" %% "spark-hive-thriftserver" % SPARK_VERSION % "provided", "org.apache.kafka" %% "kafka" % "0.8.1.1" exclude("javax.jms", "jms") exclude("com.sun.jdmk", "jmxtools") exclude("com.sun.jmx", "jmxri"), "joda-time" % "joda-time" % "2.7", "log4j" % "log4j" % "1.2.14" exclude("com.sun.jdmk", "jmxtools") exclude("com.sun.jmx", "jmxri") ) Appreciate the assistance. -Todd On Tue, Apr 7, 2015 at 4:09 PM, James Aley <james.a...@swiftkey.com> wrote: > Excellent, thanks for your help, I appreciate your advice! > On 7 Apr 2015 20:43, "Michael Armbrust" <mich...@databricks.com> wrote: > >> That should totally work. The other option would be to run a persistent >> metastore that multiple contexts can talk to and periodically run a job >> that creates missing tables. The trade-off here would be more complexity, >> but less downtime due to the server restarting. >> >> On Tue, Apr 7, 2015 at 12:34 PM, James Aley <james.a...@swiftkey.com> >> wrote: >> >>> Hi Michael, >>> >>> Thanks so much for the reply - that really cleared a lot of things up >>> for me! >>> >>> Let me just check that I've interpreted one of your suggestions for (4) >>> correctly... Would it make sense for me to write a small wrapper app that >>> pulls in hive-thriftserver as a dependency, iterates my Parquet >>> directory structure to discover "tables" and registers each as a temp table >>> in some context, before calling HiveThriftServer2.createWithContext as >>> you suggest? >>> >>> This would mean that to add new content, all I need to is restart that >>> app, which presumably could also be avoided fairly trivially by >>> periodically restarting the server with a new context internally. That >>> certainly beats manual curation of Hive table definitions, if it will work? >>> >>> >>> Thanks again, >>> >>> James. >>> >>> On 7 April 2015 at 19:30, Michael Armbrust <mich...@databricks.com> >>> wrote: >>> >>>> 1) What exactly is the relationship between the thrift server and Hive? >>>>> I'm guessing Spark is just making use of the Hive metastore to access >>>>> table >>>>> definitions, and maybe some other things, is that the case? >>>>> >>>> >>>> Underneath the covers, the Spark SQL thrift server is executing queries >>>> using a HiveContext. In this mode, nearly all computation is done with >>>> Spark SQL but we try to maintain compatibility with Hive wherever >>>> possible. This means that you can write your queries in HiveQL, read >>>> tables from the Hive metastore, and use Hive UDFs UDTs UDAFs, etc. >>>> >>>> The one exception here is Hive DDL operations (CREATE TABLE, etc). >>>> These are passed directly to Hive code and executed there. The Spark SQL >>>> DDL is sufficiently different that we always try to parse that first, and >>>> fall back to Hive when it does not parse. >>>> >>>> One possibly confusing point here, is that you can persist Spark SQL >>>> tables into the Hive metastore, but this is not the same as a Hive table. >>>> We are only use the metastore as a repo for metadata, but are not using >>>> their format for the information in this case (as we have datasources that >>>> hive does not understand, including things like schema auto discovery). >>>> >>>> HiveQL DDL, run by Hive but can be read by Spark SQL: CREATE TABLE t (x >>>> INT) SORTED AS PARQUET >>>> Spark SQL DDL, run by Spark SQL, stored in metastore, cannot be read by >>>> hive: CREATE TABLE t USING parquet (path '/path/to/data') >>>> >>>> >>>>> 2) Am I therefore right in thinking that SQL queries sent to the >>>>> thrift server are still executed on the Spark cluster, using Spark SQL, >>>>> and >>>>> Hive plays no active part in computation of results? >>>>> >>>> >>>> Correct. >>>> >>>> 3) What SQL flavour is actually supported by the Thrift Server? Is it >>>>> Spark SQL, Hive, or both? I've confused, because I've seen it accepting >>>>> Hive CREATE TABLE syntax, but Spark SQL seems to work too? >>>>> >>>> >>>> HiveQL++ (with Spark SQL DDL). You can make it use our simple SQL >>>> parser by `SET spark.sql.dialect=sql`, but honestly you probably don't want >>>> to do this. The included SQL parser is mostly there for people who have >>>> dependency conflicts with Hive. >>>> >>>> >>>>> 4) When I run SQL queries using the Scala or Python shells, Spark >>>>> seems to figure out the schema by itself from my Parquet files very well, >>>>> if I use createTempTable on the DataFrame. It seems when running the >>>>> thrift >>>>> server, I need to create a Hive table definition first? Is that the case, >>>>> or did I miss something? If it is, is there some sensible way to automate >>>>> this? >>>>> >>>> >>>> Temporary tables are only visible to the SQLContext that creates them. >>>> If you want it to be visible to the server, you need to either start the >>>> thrift server with the same context your program is using >>>> (see HiveThriftServer2.createWithContext) or make a metastore table. This >>>> can be done using Spark SQL DDL: >>>> >>>> CREATE TABLE t USING parquet (path '/path/to/data') >>>> >>>> Michael >>>> >>> >>> >>