Dipankar created SPARK-15682:
--------------------------------

             Summary: Hive ORC partition write looks for root hdfs folder for 
existence
                 Key: SPARK-15682
                 URL: https://issues.apache.org/jira/browse/SPARK-15682
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.1
            Reporter: Dipankar


Scenario:
I am using the below program to create new partition based on the current date 
which signifies the run date.

However, it fails citing hdfs folder already exists. It checks the root folder 
and not new partition value.

Is partitionBy clause actually not checking the hive metastore or folder till 
proc_date= some value. ? and it's just a way to create folders based on 
partition key. Not any way related to hive partition ??

Alternatively, should i use
result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30")
 to achieve the result.

But this will not update hive metastore with new partition details.
Is spark orc support not equivalent to HCatStorer API?

My hive table is built with proc_date as partition column. 



Source code :
result.registerTempTable("result_tab")
val result_partition = sqlContext.sql("FROM result_tab select *,'"+curr_date+"' 
as proc_date")
result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc")


Exception
16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select 
*,'2016-05-31' as proc_date
16/05/31 15:57:34 INFO ParseDriver: Parse Completed
Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already exists.;
        at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
        at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
        at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
        at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
        at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
        at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
        at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
        at SampleApp$.main(SampleApp.scala:31)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to