[
https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhijeet updated SPARK-29299:
-----------------------------
Summary: Intermittently getting "Cannot create the managed table error"
while creating table from spark 2.4 (was: Intermittently getting "Can not
create the managed table error" while creating table from spark 2.4)
> Intermittently getting "Cannot create the managed table error" while creating
> table from spark 2.4
> --------------------------------------------------------------------------------------------------
>
> Key: SPARK-29299
> URL: https://issues.apache.org/jira/browse/SPARK-29299
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.0
> Reporter: Abhijeet
> Priority: Major
>
> We are facing below error in spark 2.4 intermittently when saving the managed
> table from spark.
> Error -
> pyspark.sql.utils.AnalysisException: u"Can not create the managed
> table('`hive_issue`.`table`'). The associated
> location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
> already exists.;"
> Steps to reproduce--
> 1. Create dataframe from spark mid size data (30MB CSV file)
> 2. Save dataframe as a table
> 3. Terminate the session when above mentioned operation is in progress
> Note--
> Session termination is just a way to repro this issue. In real time we are
> facing this issue intermittently when we are running same spark jobs multiple
> times. We use EMRFS and HDFS from EMR cluster and we face the same issue on
> both of the systems.
> The only ways we can fix this is by deleting the target folder where table
> will keep its files which is not option for us and we need to keep historical
> information in the table hence we use APPEND mode while writing to table.
> Sample code--
> from pyspark.sql import SparkSession
> sc = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
> print "STARTED WRITING TO TABLE"
> # Terminate session using ctrl + c after this statement post df.write action
> started
> df.write.mode("append").saveAsTable("hive_issue.table")
> print "COMPLETED WRITING TO TABLE"
> We went through the documentation of spark 2.4 [1] and found that spark is no
> longer allowing to create manage tables on non empty folders.
> 1. Any reason behind change in the spatk behaviour
> 2. To us it looks like a breaking change as despite specifying "overwrite"
> option spark in unable to wipe out existing data and create tables
> 3. Do we have any solution for this issue other that setting
> "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag
> [1]
> https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]