[GitHub] [hudi] BBency opened a new issue, #9094: Async Clustering failing with errors for MOR table

via GitHub Thu, 29 Jun 2023 08:43:21 -0700


BBency opened a new issue, #9094:
URL: https://github.com/apache/hudi/issues/9094


   **Problem Description**
   
   We have a MOR table which is partitioned by yearmonth(yyyyMM). We would like 
to trigger async clustering after doing the compaction at the end of the day so 
that we can stitch together small files into larger files. Async clustering for 
the table is failing. Below are the different approaches I tried and the error 
messages I got.
   
   **Hudi Config Used**
   
   ```
   "hoodie.table.name" -> hudiTableName,
   "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.precombine.field" -> preCombineKey,
   "hoodie.datasource.write.recordkey.field" -> recordKey,
   "hoodie.datasource.write.operation" -> writeOperation,
   "hoodie.datasource.write.row.writer.enable" -> "true",
   "hoodie.datasource.write.reconcile.schema" -> "true",
   "hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
   "hoodie.datasource.write.hive_style_partitioning" -> "true",
   "hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
   "hoodie.datasource.hive_sync.enable" -> "true",
   "hoodie.datasource.hive_sync.table" -> hudiTableName,
   "hoodie.datasource.hive_sync.database" -> databaseName,
   "hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
   "hoodie.datasource.hive_sync.partition_extractor_class" -> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.datasource.hive_sync.use_jdbc" -> "false",
   "hoodie.combine.before.upsert" -> "true",
   "hoodie.index.type" -> "BLOOM",
   "spark.hadoop.parquet.avro.write-old-list-structure" -> "false"
   "hoodie.datasource.write.table.type" -> "MERGE_ON_READ"
   "hoodie.compact.inline" -> "false",
   "hoodie.compact.schedule.inline" -> "true",
   "hoodie.compact.inline.trigger.strategy" -> "NUM_COMMITS",
   "hoodie.compact.inline.max.delta.commits" -> "5",
   "hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS",
   "hoodie.cleaner.commits.retained" -> "3",
   "hoodie.clustering.async.enabled" -> "true",
   "hoodie.clustering.async.max.commits" -> "2",
   "hoodie.clustering.execution.strategy.class" -> 
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
   "hoodie.clustering.plan.strategy.sort.columns" -> recordKey,
   "hoodie.clustering.plan.strategy.small.file.limit" -> "67108864",
   "hoodie.clustering.plan.strategy.target.file.max.bytes" -> "134217728",
   "hoodie.clustering.plan.strategy.max.bytes.per.group" -> "2147483648",
   "hoodie.clustering.plan.strategy.max.num.groups" -> "150",
   "hoodie.clustering.preserve.commit.metadata" -> "true"
   ```
   
   **Approaches Tried**
   
   1. Triggered a clustering job with running mode as scheduleAndExecute
   **Code Used**        
           
   ```     val hudiClusterConfig = new HoodieClusteringJob.Config
        hudiClusterConfig.basePath = <table-path>
        hudiClusterConfig.tableName = <table-name>
        hudiClusterConfig.runningMode = "scheduleAndExecute"
        hudiClusterConfig.retryLastFailedClusteringJob = true
        val configList: util.List[String] = new util.ArrayList()
        configList.add("hoodie.clustering.async.enabled=true")
        configList.add("hoodie.clustering.async.max.commits=2") 
configList.add("hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
        
configList.add("hoodie.clustering.plan.strategy.sort.columns=<sort-columns>")
        
configList.add("hoodie.clustering.plan.strategy.small.file.limit=67108864")
        
configList.add("hoodie.clustering.plan.strategy.target.file.max.bytes=134217728")
        
configList.add("hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648")
        configList.add("hoodie.clustering.plan.strategy.max.num.groups=150")
        configList.add("hoodie.clustering.preserve.commit.metadata=true")
        hudiClusterConfig.configs = configList
        val hudiClusterJob = new HoodieClusteringJob(jsc, hudiClusterConfig)
        val clusterStatus = hudiClusterJob.cluster(1)
        println(clusterStatus)
    ```
   
    **Stacktrace**
   
   ShuffleMapStage 87 (sortBy at RDDCustomColumnsSortPartitioner.java:64) 
failed in 1.098 s due to Job aborted due to stage failure: task 0.0 in stage 
28.0 (TID 367) had a not serializable result: 
org.apache.avro.generic.GenericData$Record
   Serialization stack:
        - object not serializable (class: 
org.apache.avro.generic.GenericData$Record, value:
        
   2. Used the procedure run_clustering to schedule and trigger clustering. We 
found that the replacecommit created through the procedure run had lesser data 
compared to what it was created when scheduled from the code in approach 1
   **Code Used**
        
   ```query_run_clustering = f"call run_clustering(path => '{path}')"
       spark_df_run_clustering = spark.sql(query_run_clustering)
       spark_df_run_clustering.show()
   ```
   
   **Stacktrace**
   
   An error occurred while calling o97.sql.
        : org.apache.hudi.exception.HoodieClusteringException: Clustering 
failed to write to 
files:c94cb139-70cf-4195-ad87-c56527ab5ccf-0,bc2c65f1-39fc-4879-ba83-5003fc9757b0-0,7e699100-39a3-46f7-ac7d-42e9cfaad2e1-0,a6076357-8a7f-4ae1-b6ec-2dd509d9818e-0,9a6752a4-1bcb-4dfb-ad82-80877d07cbdc-0,e5573f8c-c5bc-45b4-a670-1bcd9257726d-0,b00372f1-bd6d-4e46-9add-0ceca84f005a-0,6eb6bc42-b086-4aa0-a899-0b0ff602b7bf-0,35a06cda-57df-457f-aa8c-4792fd52cf33-0,78c75d85-ab08-4e97-9127-6b350d07e8f8-0,18ed0a15-9d42-495b-a43c-140b08dbc852-0,e2f5f9da-0717-4b8e-95b3-09639f2fc4a9-0,700a07e2-2114-4d50-9673-0e3dc885da55-0,1836db85-1320-4ff8-8aea-fc5dbbe267c7-0,b6c0eb8a-fd1e-40e6-bc8c-3e3b6180d916-0,225b791e-ac7b-4a6d-a295-e547c3e6a558-0,e567f6fb-bf27-496a-9c67-d26a5824870e-0,7a40f1c3-c3f5-433f-9cb8-5773de8d9557-0,b4f336b9-6669-4510-a2eb-c300fdae2320-0,1f4ef584-c199-449a-ba82-19b79531432e-0,b3b06f51-32e5-4a94-9ffe-035c08ae7f50-0,debcc1fc-8a67-4a0b-8691-d28b96c0403a-0,c40a0b32-8394-4c0c-8d41-a58e247e44c9-0,942b69
 
c8-a292-4ba6-86a6-9c3e344a9cd6-0,80f06951-1497-4cca-861e-22addd451ddb-0,2eb68890-154a-4963-90fd-47a1a32dceaf-0,5f05cffc-7a4b-4817-8e3e-14905fd81b9b-0,1acba9bf-1ef8-40e8-8a1d-7a54ebc6387e-0,008fd3cc-987b-4855-8125-b5d0529a26a1-0,dfaf9d4c-f23e-49d4-98df-078622fb9383-0
                at 
org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:381)
   
   Would appreciate any help/ inputs
   
   **Expected behavior**
   
   Clustering should stitch together the smaller files
   
   **Environment Description**
   
   * Platform: AWS Glue v4.0
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] BBency opened a new issue, #9094: Async Clustering failing with errors for MOR table

Reply via email to