[ 
https://issues.apache.org/jira/browse/SPARK-43149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-43149:
----------------------------------
    Summary: When CTAS with USING fails to store metadata in metastore, data 
gets left around  (was: When CREATE USING fails to store metadata in metastore, 
data gets left around)

> When CTAS with USING fails to store metadata in metastore, data gets left 
> around
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-43149
>                 URL: https://issues.apache.org/jira/browse/SPARK-43149
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> For example:
> {noformat}
> drop table if exists parquet_ds1;
> -- try creating table with invalid column name
> -- use 'using parquet' to designate the data source
> create table parquet_ds1 using parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id)
> from range(0, 10);
> Cannot create a table having a column whose name contains commas in Hive 
> metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE 
> '2018-01-01' + make_dt_interval(0, id, 0, 0.000000)
> -- show that table did not get created
> show tables;
> -- try again with valid column name
> -- spark will complain that directory already exists
> create table parquet_ds1 using parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id) as ts
> from range(0, 10);
> [LOCATION_ALREADY_EXISTS] Cannot name the managed table as 
> `spark_catalog`.`default`.`parquet_ds1`, as its associated location 
> 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already 
> exists. Please pick a different table name, or remove the existing location 
> first.
> org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name 
> the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its 
> associated location 
> 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already 
> exists. Please pick a different table name, or remove the existing location 
> first.
>       at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804)
>       at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414)
>       at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
> ...
> {noformat}
> One must manually remove the directory {{spark-warehouse/parquet_ds1}} before 
> the {{create table}} command will succeed.
> It seems that datasource table creation runs the data-creation job first, 
> then stores the metadata into the metastore.
> When using Spark to create Hive tables, the issue does not happen:
> {noformat}
> drop table if exists parquet_hive1;
> -- try creating table with invalid column name,
> -- but use 'stored as parquet' instead of 'using'
> create table parquet_hive1 stored as parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id)
> from range(0, 10);
> Cannot create a table having a column whose name contains commas in Hive 
> metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE 
> '2018-01-01' + make_dt_interval(0, id, 0, 0.000000)
> -- try again with valid column name. This will succeed;
> create table parquet_hive1 stored as parquet as
> select id, date'2018-01-01' + make_dt_interval(0, id) as ts
> from range(0, 10);
> {noformat}
> It seems that Hive table creation stores metadata into the metastore first, 
> then runs the data-creation job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to