[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

Xin Wu (JIRA) Sat, 30 Apr 2016 20:20:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265591#comment-15265591
 ]


Xin Wu commented on SPARK-14927:
--------------------------------

Since Spark 2.0.0 has moved around a lot of stuff, including splitting the 
HiveMetaStoreCatalog into 2 files for resolving and creating tables, 
respectively, I would try this on Spark 2.0.0. 

{code}scala> spark.sql("create database if not exists tmp")
16/04/30 19:59:12 WARN ObjectStore: Failed to get database tmp, returning 
NoSuchObjectException
res23: org.apache.spark.sql.DataFrame = []

scala> 
df.write.partitionBy("year").mode(SaveMode.Append).saveAsTable("tmp.tmp1")
16/04/30 19:59:50 WARN CreateDataSourceTableUtils: Persisting partitioned data 
source relation `tmp`.`tmp1` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive. Input path(s): 
file:/home/xwu0226/spark/spark-warehouse/tmp.db/tmp1

scala> spark.sql("select * from tmp.tmp1").show
+---+----+
|val|year|
+---+----+
|  a|2012|
+---+----+
{code}

For datasource table creation as above, SparkSQL will create the table as a 
hive internal table but not compatible with hive. SparkSQL puts partition 
column information (actually including also other things like column schema, 
bucket/sort columns) into serdeInfo.parameters. When querying the table, 
SparkSQL resolve the table and parse the information back from 
serdeInfo.parameters. 

Spark 2.0.0 does not pass this command to Hive anymore (actually most of DDL 
commands are run natively in SparkSQL now), so when doing "SHOW PARTITIONS...", 
the command now does not support showing partitions for datasource table. 

{code}
scala> spark.sql("show partitions tmp.tmp1").show
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a 
datasource table: tmp.tmp1;
  at 
org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(commands.scala:196)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:132)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:129)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:112)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529)
  ... 48 elided
{code}

Hope this helps. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> ---------------------------------------------------------------------
>
>                 Key: SPARK-14927
>                 URL: https://issues.apache.org/jira/browse/SPARK-14927
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.1
>         Environment: Mac OS X 10.11.4 local
>            Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
>     hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
>     //    hc.setConf("hive.exec.dynamic.partition", "true")
>     //    hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
>     hc.sql("create database if not exists tmp")
>     hc.sql("drop table if exists tmp.partitiontest1")
>     Seq(2012 -> "a").toDF("year", "val")
>       .write
>       .partitionBy("year")
>       .mode(SaveMode.Append)
>       .saveAsTable("tmp.partitiontest1")
>     hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
>     ======================
>     HIVE FAILURE OUTPUT
>     ======================
>     SET hive.support.sql11.reserved.keywords=false
>     SET hive.metastore.warehouse.dir=tmp/tests
>     OK
>     OK
>     FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
>     ======================
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

Reply via email to