[ 
https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhijeet updated SPARK-29299:
-----------------------------
    Summary: Intermittently getting "Cannot create the managed table error" 
while creating table from spark 2.4  (was: Intermittently getting "Can not 
create the managed table error" while creating table from spark 2.4)

> Intermittently getting "Cannot create the managed table error" while creating 
> table from spark 2.4
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29299
>                 URL: https://issues.apache.org/jira/browse/SPARK-29299
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Abhijeet
>            Priority: Major
>
> We are facing below error in spark 2.4 intermittently when saving the managed 
> table from spark.
> Error -
> pyspark.sql.utils.AnalysisException: u"Can not create the managed 
> table('`hive_issue`.`table`'). The associated 
> location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
>  already exists.;"
> Steps to reproduce--
> 1. Create dataframe from spark mid size data (30MB CSV file)
> 2. Save dataframe as a table
> 3. Terminate the session when above mentioned operation is in progress
> Note--
> Session termination is just a way to repro this issue. In real time we are 
> facing this issue intermittently when we are running same spark jobs multiple 
> times. We use EMRFS and HDFS from EMR cluster and we face the same issue on 
> both of the systems.
> The only ways we can fix this is by deleting the target folder where table 
> will keep its files which is not option for us and we need to keep historical 
> information in the table hence we use APPEND mode while writing to table.
> Sample code--
> from pyspark.sql import SparkSession
> sc = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
> print "STARTED WRITING TO TABLE"
> # Terminate session using ctrl + c after this statement post df.write action 
> started
> df.write.mode("append").saveAsTable("hive_issue.table")
> print "COMPLETED WRITING TO TABLE"
> We went through the documentation of spark 2.4 [1] and found that spark is no 
> longer allowing to create manage tables on non empty folders.
> 1. Any reason behind change in the spatk behaviour
> 2. To us it looks like a breaking change as despite specifying "overwrite" 
> option spark in unable to wipe out existing data and create tables
> 3. Do we have any solution for this issue other that setting 
> "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag
> [1]
> https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to