[GitHub] [hudi] atharvai opened a new issue, #7353: [BUG] Hudi 0.11.x support for Spark with CTAS throws error

GitBox Thu, 01 Dec 2022 01:35:44 -0800


atharvai opened a new issue, #7353:
URL: https://github.com/apache/hudi/issues/7353


   **Describe the problem you faced**
   
   Scenario: Create table using Spark SQL with Spark 3.2 (EMR 6.7.0) and Glue 
Data Catalog for hive sync
   Expected: Successful table creation and registration with hive/glue
   Actual: Successful table creation and registration with hive/glue AND an 
error that table already exists.
   
   _This scenario does not fail if using DataFrameWriter instead of SQL._
   
   This indicates that SQL writer is performing a double hive sync somehow. so 
first hive sync is successful but second hive sync throws an error causing job 
to fail.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run following Spark SQL with Glue catalog configured
   ```sql
   CREATE TABLE {target_db}.{target_table_name} using hudi
   location 's3://{target_bucket_name}/{target_table_name}/'
   options (
       type = 'cow', primaryKey='{primary_key}', 
preCombineField='{precombine_field}',
       hoodie.table.name='{target_table_name}',
       hoodie.datasource.write.operation='upsert',
       hoodie.datasource.write.table.name='{target_table_name}',
       hoodie.datasource.write.recordkey.field='{primary_key}',
       hoodie.datasource.write.precombine.field='{precombine_field}',
       hoodie.datasource.write.partitionpath.field='{partition_fields}',
       
hoodie.datasource.write.keygenerator.class='org.apache.hudi.keygen.ComplexKeyGenerator',
       
       hoodie.datasource.hive_sync.enable='true',
       hoodie.datasource.hive_sync.mode='hms',
       hoodie.datasource.hive_sync.use_jdbc='false',
       hoodie.datasource.write.hive_style_partitioning='false',
       hoodie.datasource.hive_sync.partition_fields='{partition_fields}',
       hoodie.datasource.hive_sync.database='{target_db}',
       hoodie.datasource.hive_sync.table='{target_table_name}',
       
       hoodie.write.concurrency.mode='optimistic_concurrency_control',
       hoodie.cleaner.policy.failed.writes='LAZY',
       
hoodie.write.lock.provider='org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider',
       hoodie.write.lock.dynamodb.table='hudi_locks_{args.environment}',
       hoodie.write.lock.dynamodb.partition_key='{target_table_name}',
       hoodie.write.lock.dynamodb.region='{region}',
       hoodie.write.lock.dynamodb.billing_mode='PAY_PER_REQUEST',
       hoodie.write.lock.dynamodb.endpoint_url='dynamodb.{region}.amazonaws.com'
   )
   partitioned by ({partition_fields})
   AS 
   SELECT *
   FROM {source_db}.{source_table};
   ```
   
   **Expected behavior**
   
   Successful table creation and registration with hive/glue and spark job 
completes with success.
   
   **Environment Description**
   [EMR 
6.7.0](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/release-version-670.html)
   
   * Hudi version : 0.11.x (both 0.11.0 and 0.11.1)
   * Spark version : 3.2.1 (EMR 6.7.0)
   * Hive version : 3.1.3 (EMR 6.7.0)
   * Hadoop version :
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Traceback (most recent call last):
     File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", 
line 84, in <module>
       args.primary_key, args.precombine_field, args.partition_fields, 
args.region)
     File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", 
line 62, in remodel_table
       spark.sql(sql)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 
723, in sql
     File 
"/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 
1322, in __call__
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
117, in deco
   pyspark.sql.utils.AnalysisException: Table or view '{target_table_name}' 
already exists in database '{target_db}'
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] atharvai opened a new issue, #7353: [BUG] Hudi 0.11.x support for Spark with CTAS throws error

Reply via email to