Bruce Robbins created SPARK-43149:
-------------------------------------
Summary: When CREATE USING fails to store metadata in metastore,
data gets left around
Key: SPARK-43149
URL: https://issues.apache.org/jira/browse/SPARK-43149
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins
For example:
{noformat}
drop table if exists parquet_ds1;
-- try creating table with invalid column name
-- use 'using parquet' to designate the data source
create table parquet_ds1 using parquet as
select id, date'2018-01-01' + make_dt_interval(0, id)
from range(0, 10);
Cannot create a table having a column whose name contains commas in Hive
metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE
'2018-01-01' + make_dt_interval(0, id, 0, 0.000000)
-- show that table did not get created
show tables;
-- try again with valid column name
-- spark will complain that directory already exists
create table parquet_ds1 using parquet as
select id, date'2018-01-01' + make_dt_interval(0, id) as ts
from range(0, 10);
[LOCATION_ALREADY_EXISTS] Cannot name the managed table as
`spark_catalog`.`default`.`parquet_ds1`, as its associated location
'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already
exists. Please pick a different table name, or remove the existing location
first.
org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name
the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated
location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1'
already exists. Please pick a different table name, or remove the existing
location first.
at
org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414)
at
org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
...
{noformat}
One must manually remove the directory {{spark-warehouse/parquet_ds1}} before
the {{create table}} command will succeed.
It seems that datasource table creation runs the data-creation job first, then
stores the metadata into the metastore.
When using Spark to create Hive tables, the issue does not happen:
{noformat}
drop table if exists parquet_hive1;
-- try creating table with invalid column name,
-- but use 'stored as parquet' instead of 'using'
create table parquet_hive1 stored as parquet as
select id, date'2018-01-01' + make_dt_interval(0, id)
from range(0, 10);
Cannot create a table having a column whose name contains commas in Hive
metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE
'2018-01-01' + make_dt_interval(0, id, 0, 0.000000)
-- try again with valid column name. This will succeed;
create table parquet_hive1 stored as parquet as
select id, date'2018-01-01' + make_dt_interval(0, id) as ts
from range(0, 10);
{noformat}
It seems that Hive table creation stores metadata into the metastore first,
then runs the data-creation job.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]