navbalaraman opened a new issue, #6101:
URL: https://github.com/apache/hudi/issues/6101
### Describe the problem
I'm using a Spark job running on EMR to insert data using hudi (0.9.0). The
inserts are working as expected and it stores parquet files in Amazon S3 and I
have AWS Glue Data catalog which is used to read data from this S3 using Amazon
Athena.
Now I have a use case where i need to delete some records in the dataset and
I tried using hudi delete but the records are not getting deleted (don't see
new parquet file without the delete record getting created). The job does not
throw any error either. Any thoughts on what could be missing?
```
SparkSession.builder()
.appName("CCPA Record Deletion")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.config("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
.config("spark.hadoop.fs.s3a.fast.upload", "true")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.parquet.mergeSchema", "false")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.speculation", "false")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint",
"s3.amazonaws.com")
```
Hudi configuration:
```
val hudiOptions = hudioptions()
def hudioptions(): Map[String, String] = {
Map[String, String](
HoodieWriteConfig.TABLE_NAME -> "table_name",
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id_field,
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->
"partition_field",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp_field",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_URL_OPT_KEY ->
s"jdbc:hive2://masterdns:10000",
DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "db_name",
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> s"table_name}",
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY ->
"partition_field",
DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false",
DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
classOf[MultiPartKeysValueExtractor].getName,
HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED_PROP -> "1",
HoodieCompactionConfig.CLEANER_POLICY_PROP ->
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name()
)
}
val dataFrameToDelete = S3DataDataFrame.where("column_name in ('a','b')")
(dataFrameToDelete.write
.format("org.apache.hudi")
.options(hudiOptions)
.option(DataSourceWriteOptions.OPERATION.key(),
DataSourceWriteOptions.DELETE_PARTITION_OPERATION_OPT_VAL)
.mode(SaveMode.Append)
.save(S3_Path + "/"))
```
spark and hive config for AWS Glue support:
```
"Classification" = "spark-hive-site",
"Properties" = {
"hive.metastore.client.factory.class" =
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
"Classification" = "hive-site",
"Properties" = {
"hive.metastore.client.factory.class" =
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
```
### Additional Info
EMR 6.5, hudi 0.9.0, Spark 3.1.2, Hive 3.1.2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]