[ https://issues.apache.org/jira/browse/SPARK-43149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-43149: ---------------------------------- Summary: When CTAS with USING fails to store metadata in metastore, data gets left around (was: When CREATE USING fails to store metadata in metastore, data gets left around) > When CTAS with USING fails to store metadata in metastore, data gets left > around > -------------------------------------------------------------------------------- > > Key: SPARK-43149 > URL: https://issues.apache.org/jira/browse/SPARK-43149 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Reporter: Bruce Robbins > Priority: Major > > For example: > {noformat} > drop table if exists parquet_ds1; > -- try creating table with invalid column name > -- use 'using parquet' to designate the data source > create table parquet_ds1 using parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) > from range(0, 10); > Cannot create a table having a column whose name contains commas in Hive > metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE > '2018-01-01' + make_dt_interval(0, id, 0, 0.000000) > -- show that table did not get created > show tables; > -- try again with valid column name > -- spark will complain that directory already exists > create table parquet_ds1 using parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) as ts > from range(0, 10); > [LOCATION_ALREADY_EXISTS] Cannot name the managed table as > `spark_catalog`.`default`.`parquet_ds1`, as its associated location > 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already > exists. Please pick a different table name, or remove the existing location > first. > org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name > the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its > associated location > 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already > exists. Please pick a different table name, or remove the existing location > first. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > ... > {noformat} > One must manually remove the directory {{spark-warehouse/parquet_ds1}} before > the {{create table}} command will succeed. > It seems that datasource table creation runs the data-creation job first, > then stores the metadata into the metastore. > When using Spark to create Hive tables, the issue does not happen: > {noformat} > drop table if exists parquet_hive1; > -- try creating table with invalid column name, > -- but use 'stored as parquet' instead of 'using' > create table parquet_hive1 stored as parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) > from range(0, 10); > Cannot create a table having a column whose name contains commas in Hive > metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE > '2018-01-01' + make_dt_interval(0, id, 0, 0.000000) > -- try again with valid column name. This will succeed; > create table parquet_hive1 stored as parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) as ts > from range(0, 10); > {noformat} > It seems that Hive table creation stores metadata into the metastore first, > then runs the data-creation job. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org