[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table src).

 

==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w)

scala sqlContext.sql(SELECT * from src).saveAsTable(table1)

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2)

scala sqlContext.sql(SELECT * from src).saveAsTable(table2)

==

After these steps, the tables are stored in /test/w only. I expect
table2 to be stored in /test/w2 folder.

 

Another question is: if I set hive.metastore.warehouse.dir to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///

at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

at
org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:37
0)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands
.scala:55)

at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65
)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte
xt.scala:1088)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10
88)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC.init(console:33)

at $iwC$$iwC.init(console:35)

at $iwC.init(console:37)

at init(console:39)

 

Thank you very much!

 



RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Cheng, Hao
I am not so sure if Hive supports change the metastore after initialized, I 
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably 
that's why it doesn't work as expected for Q1.

BTW, in most of cases, people configure the metastore settings in 
hive-site.xml, and will not change that since then, is there any reason that 
you want to change that in runtime?

For Q2, probably something wrong in configuration, seems the HDFS run into the 
pseudo/single node mode, can you double check that? Or can you run the DDL 
(like create a table) from the spark shell with HiveContext?

From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?


I'm using Spark 1.3.0 RC3 build with Hive support.



In Spark Shell, I want to reuse the HiveContext instance to different warehouse 
locations. Below are the steps for my test (Assume I have loaded a file into 
table src).



==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w)

scala sqlContext.sql(SELECT * from src).saveAsTable(table1)

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2)

scala sqlContext.sql(SELECT * from src).saveAsTable(table2)

==

After these steps, the tables are stored in /test/w only. I expect table2 
to be stored in /test/w2 folder.



Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, 
I cannot use saveAsTable()? Is this by design? Exception stack trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at 
TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS: 
hdfs://server:8020/space/warehouse/table2, expected: file:///file:///\\

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at 
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)

at 
org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)

at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at 
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC.init(console:33)

at $iwC$$iwC.init(console:35)

at $iwC.init(console:37)

at init(console:39)



Thank you very much!