[jira] [Comment Edited] (SPARK-15682) Hive ORC partition write looks for root hdfs folder for existence

Dipankar (JIRA) Tue, 31 May 2016 14:42:02 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308679#comment-15308679
 ]


Dipankar edited comment on SPARK-15682 at 5/31/16 9:40 PM:
-----------------------------------------------------------

I could make this work by including the safemode = append. 
result_partition.write.format("orc").partitionBy("proc_date").mode("append").save("test.sms_outbound_view_orc")

Looks like , since the partition column value is obtained from the data frame, 
there is no way to find statically the partition value.
Hence, root folder is checked. If we add append, it skips this check and would 
append the contents if a partition is existent or create new one as the case 
may be.

When would be have handle to the hive table instead of hdfs path for ORC writes?


was (Author: dghosal):
I could make this work by including the safemode = append. 
result_partition.write.format("orc").partitionBy("proc_date").mode("append").save("test.sms_outbound_view_orc")

Looks like , since the partition column value is obtained from the data frame, 
there is no way to find statically the partition value.
Hence, root folder is checked. If we add append, it skips this check and would 
append the contents if a partition is existent or create new one as the case 
may be.

However, it DOES NOT update the hive metastore with new partition information!!!

> Hive ORC partition write looks for root hdfs folder for existence
> -----------------------------------------------------------------
>
>                 Key: SPARK-15682
>                 URL: https://issues.apache.org/jira/browse/SPARK-15682
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Dipankar
>
> Scenario:
> I am using the below program to create new partition based on the current 
> date which signifies the run date.
> However, it fails citing hdfs folder already exists. It checks the root 
> folder and not new partition value.
> Is partitionBy clause actually not checking the hive metastore or folder till 
> proc_date= some value. ? and it's just a way to create folders based on 
> partition key. Not any way related to hive partition ??
> Alternatively, should i use
> result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30")
>  to achieve the result.
> But this will not update hive metastore with new partition details.
> Is spark orc support not equivalent to HCatStorer API?
> My hive table is built with proc_date as partition column. 
> Source code :
> result.registerTempTable("result_tab")
> val result_partition = sqlContext.sql("FROM result_tab select 
> *,'"+curr_date+"' as proc_date")
> result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc")
> Exception
> 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select 
> *,'2016-05-31' as proc_date
> 16/05/31 15:57:34 INFO ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already 
> exists.;
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>       at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>       at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>       at SampleApp$.main(SampleApp.scala:31)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-15682) Hive ORC partition write looks for root hdfs folder for existence

Reply via email to