Spark SQL, Parquet and Impala

Patrick McGloin Fri, 01 Aug 2014 07:19:27 -0700

Hi,

We would like to use Spark SQL to store data in Parquet format and then
query that data using Impala.


We've tried to come up with a solution and it is working but it doesn't
seem good.  So I was wondering if you guys could tell us what is the
correct way to do this.  We are using Spark 1.0 and Impala 1.3.1.

First we are registering our tables using SparkSQL:

val sqlContext = new SQLContext(sc)
sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
true)

Then we are using the HiveContext to register the table and do the insert:

val hiveContext = new HiveContext(sc)
import hiveContext._
hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")
eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))

Now we have the data stored in a Parquet file.  To access it in Hive or
Impala we run

Spark SQL, Parquet and Impala

Reply via email to