BBency opened a new issue, #9094:
URL: https://github.com/apache/hudi/issues/9094
**Problem Description**
We have a MOR table which is partitioned by yearmonth(yyyyMM). We would like
to trigger async clustering after doing the compaction at the end of the day so
that we can stitch together small files into larger files. Async clustering for
the table is failing. Below are the different approaches I tried and the error
messages I got.
**Hudi Config Used**
```
"hoodie.table.name" -> hudiTableName,
"hoodie.datasource.write.keygenerator.class" ->
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.precombine.field" -> preCombineKey,
"hoodie.datasource.write.recordkey.field" -> recordKey,
"hoodie.datasource.write.operation" -> writeOperation,
"hoodie.datasource.write.row.writer.enable" -> "true",
"hoodie.datasource.write.reconcile.schema" -> "true",
"hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.table" -> hudiTableName,
"hoodie.datasource.hive_sync.database" -> databaseName,
"hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
"hoodie.datasource.hive_sync.partition_extractor_class" ->
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.combine.before.upsert" -> "true",
"hoodie.index.type" -> "BLOOM",
"spark.hadoop.parquet.avro.write-old-list-structure" -> "false"
"hoodie.datasource.write.table.type" -> "MERGE_ON_READ"
"hoodie.compact.inline" -> "false",
"hoodie.compact.schedule.inline" -> "true",
"hoodie.compact.inline.trigger.strategy" -> "NUM_COMMITS",
"hoodie.compact.inline.max.delta.commits" -> "5",
"hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS",
"hoodie.cleaner.commits.retained" -> "3",
"hoodie.clustering.async.enabled" -> "true",
"hoodie.clustering.async.max.commits" -> "2",
"hoodie.clustering.execution.strategy.class" ->
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
"hoodie.clustering.plan.strategy.sort.columns" -> recordKey,
"hoodie.clustering.plan.strategy.small.file.limit" -> "67108864",
"hoodie.clustering.plan.strategy.target.file.max.bytes" -> "134217728",
"hoodie.clustering.plan.strategy.max.bytes.per.group" -> "2147483648",
"hoodie.clustering.plan.strategy.max.num.groups" -> "150",
"hoodie.clustering.preserve.commit.metadata" -> "true"
```
**Approaches Tried**
1. Triggered a clustering job with running mode as scheduleAndExecute
**Code Used**
``` val hudiClusterConfig = new HoodieClusteringJob.Config
hudiClusterConfig.basePath = <table-path>
hudiClusterConfig.tableName = <table-name>
hudiClusterConfig.runningMode = "scheduleAndExecute"
hudiClusterConfig.retryLastFailedClusteringJob = true
val configList: util.List[String] = new util.ArrayList()
configList.add("hoodie.clustering.async.enabled=true")
configList.add("hoodie.clustering.async.max.commits=2")
configList.add("hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
configList.add("hoodie.clustering.plan.strategy.sort.columns=<sort-columns>")
configList.add("hoodie.clustering.plan.strategy.small.file.limit=67108864")
configList.add("hoodie.clustering.plan.strategy.target.file.max.bytes=134217728")
configList.add("hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648")
configList.add("hoodie.clustering.plan.strategy.max.num.groups=150")
configList.add("hoodie.clustering.preserve.commit.metadata=true")
hudiClusterConfig.configs = configList
val hudiClusterJob = new HoodieClusteringJob(jsc, hudiClusterConfig)
val clusterStatus = hudiClusterJob.cluster(1)
println(clusterStatus)
```
**Stacktrace**
ShuffleMapStage 87 (sortBy at RDDCustomColumnsSortPartitioner.java:64)
failed in 1.098 s due to Job aborted due to stage failure: task 0.0 in stage
28.0 (TID 367) had a not serializable result:
org.apache.avro.generic.GenericData$Record
Serialization stack:
- object not serializable (class:
org.apache.avro.generic.GenericData$Record, value:
2. Used the procedure run_clustering to schedule and trigger clustering. We
found that the replacecommit created through the procedure run had lesser data
compared to what it was created when scheduled from the code in approach 1
**Code Used**
```query_run_clustering = f"call run_clustering(path => '{path}')"
spark_df_run_clustering = spark.sql(query_run_clustering)
spark_df_run_clustering.show()
```
**Stacktrace**
An error occurred while calling o97.sql.
: org.apache.hudi.exception.HoodieClusteringException: Clustering
failed to write to
files:c94cb139-70cf-4195-ad87-c56527ab5ccf-0,bc2c65f1-39fc-4879-ba83-5003fc9757b0-0,7e699100-39a3-46f7-ac7d-42e9cfaad2e1-0,a6076357-8a7f-4ae1-b6ec-2dd509d9818e-0,9a6752a4-1bcb-4dfb-ad82-80877d07cbdc-0,e5573f8c-c5bc-45b4-a670-1bcd9257726d-0,b00372f1-bd6d-4e46-9add-0ceca84f005a-0,6eb6bc42-b086-4aa0-a899-0b0ff602b7bf-0,35a06cda-57df-457f-aa8c-4792fd52cf33-0,78c75d85-ab08-4e97-9127-6b350d07e8f8-0,18ed0a15-9d42-495b-a43c-140b08dbc852-0,e2f5f9da-0717-4b8e-95b3-09639f2fc4a9-0,700a07e2-2114-4d50-9673-0e3dc885da55-0,1836db85-1320-4ff8-8aea-fc5dbbe267c7-0,b6c0eb8a-fd1e-40e6-bc8c-3e3b6180d916-0,225b791e-ac7b-4a6d-a295-e547c3e6a558-0,e567f6fb-bf27-496a-9c67-d26a5824870e-0,7a40f1c3-c3f5-433f-9cb8-5773de8d9557-0,b4f336b9-6669-4510-a2eb-c300fdae2320-0,1f4ef584-c199-449a-ba82-19b79531432e-0,b3b06f51-32e5-4a94-9ffe-035c08ae7f50-0,debcc1fc-8a67-4a0b-8691-d28b96c0403a-0,c40a0b32-8394-4c0c-8d41-a58e247e44c9-0,942b69
c8-a292-4ba6-86a6-9c3e344a9cd6-0,80f06951-1497-4cca-861e-22addd451ddb-0,2eb68890-154a-4963-90fd-47a1a32dceaf-0,5f05cffc-7a4b-4817-8e3e-14905fd81b9b-0,1acba9bf-1ef8-40e8-8a1d-7a54ebc6387e-0,008fd3cc-987b-4855-8125-b5d0529a26a1-0,dfaf9d4c-f23e-49d4-98df-078622fb9383-0
at
org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:381)
Would appreciate any help/ inputs
**Expected behavior**
Clustering should stitch together the smaller files
**Environment Description**
* Platform: AWS Glue v4.0
* Hudi version : 0.12.1
* Spark version : 3.3
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]