Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

Sanjay Subramanian Thu, 28 May 2015 13:47:56 -0700

ok guys , finally figured out how to get it running. I have detailed out the 
steps I did. Perhaps its clear to all you folks. To me it was not :-) Our 
Hadoop development environment   
   - 3 node development hadoop cluster
   - Current version CDH 5.3.3
   - Hive 0.13.1
   - Spark 1.2.0 (standalone mode)      
      - node1 (worker1, master)
      - node2 (worker2)
      - node3 (worker3)


   - Cloudera Manager to manage and update(using parcels)
Steps to get spark-sql running   
   - On every node(node1, node2, node3 above)      
      - sudo cp -avi /etc/hive/conf/hive-site.xml /etc/spark/conf

   - Edit and add a line      
      - sudo vi /opt/cloudera/parcels/CDH/lib/spark/bin/compute-classpath.sh    
     
         - # added by sanjay for running Spark using hive metadata
         - CLASSPATH="$CLASSPATH:/opt/cloudera/parcels/CDH/lib/hive/lib/*"


   - Run spark SQL in CLI mode      
      - /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql

   - Run spark SQL in async mode      
      - /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e "select * from 
band.beatles where upper(first_name) like '%GEORGE%' "

   - Run spark SQL in "SQL File" mode      
      - /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -f   get_names.hql


      From: Andrew Otto <ao...@wikimedia.org>
 To: Sanjay Subramanian <sanjaysubraman...@yahoo.com> 
Cc: user <user@spark.apache.org> 
 Sent: Thursday, May 28, 2015 7:26 AM
 Subject: Re: Pointing SparkSQL to existing Hive Metadata with data file 
locations in HDFS
   

val sqlContext = new HiveContext(sc)val schemaRdd = sqlContext.sql("some 
complex SQL")

It mostly works, but have been having issues with tables that contains a large 
amount of data:
https://issues.apache.org/jira/browse/SPARK-6910




On May 27, 2015, at 20:52, Sanjay Subramanian 
<sanjaysubraman...@yahoo.com.INVALID> wrote:
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , 
there are about 300+ hive tables.The data is stored an text (moving slowly to 
Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be 
able to define JOINS etc using a programming structure like this 
import org.apache.spark.sql.hive.HiveContextval sqlContext = new 
HiveContext(sc)val schemaRdd = sqlContext.sql("some complex SQL")

Is that the way to go ? Some guidance will be great.
thanks
sanjay

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

Reply via email to