atharvai opened a new issue, #7353:
URL: https://github.com/apache/hudi/issues/7353
**Describe the problem you faced**
Scenario: Create table using Spark SQL with Spark 3.2 (EMR 6.7.0) and Glue
Data Catalog for hive sync
Expected: Successful table creation and registration with hive/glue
Actual: Successful table creation and registration with hive/glue AND an
error that table already exists.
_This scenario does not fail if using DataFrameWriter instead of SQL._
This indicates that SQL writer is performing a double hive sync somehow. so
first hive sync is successful but second hive sync throws an error causing job
to fail.
**To Reproduce**
Steps to reproduce the behavior:
1. Run following Spark SQL with Glue catalog configured
```sql
CREATE TABLE {target_db}.{target_table_name} using hudi
location 's3://{target_bucket_name}/{target_table_name}/'
options (
type = 'cow', primaryKey='{primary_key}',
preCombineField='{precombine_field}',
hoodie.table.name='{target_table_name}',
hoodie.datasource.write.operation='upsert',
hoodie.datasource.write.table.name='{target_table_name}',
hoodie.datasource.write.recordkey.field='{primary_key}',
hoodie.datasource.write.precombine.field='{precombine_field}',
hoodie.datasource.write.partitionpath.field='{partition_fields}',
hoodie.datasource.write.keygenerator.class='org.apache.hudi.keygen.ComplexKeyGenerator',
hoodie.datasource.hive_sync.enable='true',
hoodie.datasource.hive_sync.mode='hms',
hoodie.datasource.hive_sync.use_jdbc='false',
hoodie.datasource.write.hive_style_partitioning='false',
hoodie.datasource.hive_sync.partition_fields='{partition_fields}',
hoodie.datasource.hive_sync.database='{target_db}',
hoodie.datasource.hive_sync.table='{target_table_name}',
hoodie.write.concurrency.mode='optimistic_concurrency_control',
hoodie.cleaner.policy.failed.writes='LAZY',
hoodie.write.lock.provider='org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider',
hoodie.write.lock.dynamodb.table='hudi_locks_{args.environment}',
hoodie.write.lock.dynamodb.partition_key='{target_table_name}',
hoodie.write.lock.dynamodb.region='{region}',
hoodie.write.lock.dynamodb.billing_mode='PAY_PER_REQUEST',
hoodie.write.lock.dynamodb.endpoint_url='dynamodb.{region}.amazonaws.com'
)
partitioned by ({partition_fields})
AS
SELECT *
FROM {source_db}.{source_table};
```
**Expected behavior**
Successful table creation and registration with hive/glue and spark job
completes with success.
**Environment Description**
[EMR
6.7.0](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/release-version-670.html)
* Hudi version : 0.11.x (both 0.11.0 and 0.11.1)
* Spark version : 3.2.1 (EMR 6.7.0)
* Hive version : 3.1.3 (EMR 6.7.0)
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```
Traceback (most recent call last):
File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py",
line 84, in <module>
args.primary_key, args.precombine_field, args.partition_fields,
args.region)
File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py",
line 62, in remodel_table
spark.sql(sql)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line
723, in sql
File
"/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line
1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
117, in deco
pyspark.sql.utils.AnalysisException: Table or view '{target_table_name}'
already exists in database '{target_db}'
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]