[GitHub] [hudi] navbalaraman opened a new issue, #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3

GitBox Wed, 13 Jul 2022 07:40:51 -0700


navbalaraman opened a new issue, #6101:
URL: https://github.com/apache/hudi/issues/6101


   ### Describe the problem
   I'm using a Spark job running on EMR to insert data using hudi (0.9.0). The 
inserts are working as expected and it stores parquet files in Amazon S3 and I 
have AWS Glue Data catalog which is used to read data from this S3 using Amazon 
Athena.
   Now I have a use case where i need to delete some records in the dataset and 
I tried using hudi delete but the records are not getting deleted (don't see 
new parquet file without the delete record getting created). The job does not 
throw any error either. Any thoughts on what could be missing?
   
   ```
   SparkSession.builder()
         .appName("CCPA Record Deletion")
         .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
         .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")
         .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
         .config("spark.hadoop.fs.s3a.fast.upload", "true")
         .config("spark.sql.parquet.filterPushdown", "true")
         .config("spark.sql.parquet.mergeSchema", "false")
         
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
         .config("spark.speculation", "false")
         .enableHiveSupport()
         .getOrCreate()
   
   spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", 
"s3.amazonaws.com")
   ```
   
   Hudi configuration:
   ```
   val hudiOptions = hudioptions()
   def hudioptions(): Map[String, String] = {
       Map[String, String](
         HoodieWriteConfig.TABLE_NAME -> "table_name",
         DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id_field,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> 
"partition_field",
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp_field",
         DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
         DataSourceWriteOptions.HIVE_URL_OPT_KEY -> 
s"jdbc:hive2://masterdns:10000",
         DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "db_name",
         DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> s"table_name}",
         DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> 
"partition_field",
         DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false",
         DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
         DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
         HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED_PROP -> "1",
         HoodieCompactionConfig.CLEANER_POLICY_PROP -> 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name()
   )
     }
   
   val dataFrameToDelete = S3DataDataFrame.where("column_name in ('a','b')")
         (dataFrameToDelete.write
           .format("org.apache.hudi")
           .options(hudiOptions)
           .option(DataSourceWriteOptions.OPERATION.key(), 
DataSourceWriteOptions.DELETE_PARTITION_OPERATION_OPT_VAL)
           .mode(SaveMode.Append)
           .save(S3_Path + "/"))
   ```
   
   spark and hive config for AWS Glue support:
   ```
   "Classification" = "spark-hive-site",
                 "Properties"     = {
                   "hive.metastore.client.factory.class" = 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
                 }
   "Classification" = "hive-site",
                 "Properties"     = {
                   "hive.metastore.client.factory.class" = 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
                 }
   ```
   
   ### Additional Info
   EMR 6.5, hudi 0.9.0, Spark 3.1.2, Hive 3.1.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] navbalaraman opened a new issue, #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3

Reply via email to