perfectcw opened a new issue, #7570: URL: https://github.com/apache/hudi/issues/7570
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Issue: Lost some partitions when sync hive Background: We have a data ingest pipeline, which ingest about 500 partitions per day. And the pipeline will submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit. _**And after all of commits succeed, we found that some partitions are missing in the hive table.**_ The following is the analysis of log and hoodie files: For the hoodie files, shows six of the commits. Then it was found that only _20221227042858342_ & _20221227042906103_ two commits were synced to hive, and the rest of the partitions did not appear in hive table. I think the root cause is because of the mechanism of sync hive. When hudi sync hive after the commit is succeed, it will first get the latest synced commit, and then use the timestamp of this commit as a benchmark to check whether the new column and partition are added to the commit behind it, and if so, it will sync to hive. So if a commit A is submmitted before this latest synced commit B, but succeeds after commit B, so it will not be synced hive. Because of commit A's timestamp < commit B's timestamp, it won't be detected. Here is the log of commit 20221227042859357, we can see it get latest synced commit is 20221227042906103, which commit after 20221227042859357 itself. So the partition inserted by 20221227042859357 commit has not been detected, and the partition that needs to be synced is 0. `2022-12-27 04:30:16,449 INFO hive.metastore: Opened a connection to metastore, current connections: 1 2022-12-27 04:30:16,465 INFO hive.metastore: Connected to metastore. 2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Syncing target hoodie table with hive table(forecast_agg_hoover_multi_publish). Hive metastore URL :jdbc:hive2://hs2.presto.stg.aws.fwmrm.net:10000/;auth=noSasl, basePath :s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish 2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Trying to sync hoodie table forecast_agg_hoover_multi_publish with base path s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish of type COPY_ON_WRITE 2022-12-27 04:30:16,815 INFO table.TableSchemaResolver: Reading schema from s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish/20221227/0/20230108/9820ce59-03a8-4efa-8978-3c3cf61298d8-0_1-11-3890_20221227042906103.parquet 2022-12-27 04:30:16,904 INFO s3a.S3AInputStream: Switching to Random IO seek policy 2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: No Schema difference for forecast_agg_hoover_multi_publish 2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: Schema sync complete. Syncing partitions for forecast_agg_hoover_multi_publish 2022-12-27 04:30:17,525 INFO hive.HiveSyncTool: Last commit time synced was found to be 20221227042906103 2022-12-27 04:30:17,525 INFO common.AbstractSyncHoodieClient: Last commit time synced is 20221227042906103, Getting commits since then 2022-12-27 04:30:17,527 INFO hive.HiveSyncTool: Storage partitions scan complete. Found 0 2022-12-27 04:30:17,697 INFO hive.HiveSyncTool: Sync complete for forecast_agg_hoover_multi_publish` `order by time name type last modify time partition if exist in hive 20221227042855832.commit.requested requested 2022-12-27 pm12:28:59 CST 20221227/0/20230101 no 20221227042858342.commit.requested requested 2022-12-27 pm12:29:00 CST 20221227/0/20230106 yes 20221227042858801.commit.requested requested 2022-12-27 pm12:29:01 CST 20221227/0/20230107 no 20221227042859357.commit.requested requested 2022-12-27 pm12:29:01 CST 20221227/0/20221229 no 20221227042901993.commit.requested requested 2022-12-27 pm12:29:04 CST 20221227/0/20230103 no 20221227042906103.commit.requested requested 2022-12-27 pm12:29:08 CST 20221227/0/20230108 yes ... 20221227042855832.inflight inflight 2022-12-27 pm12:29:16 CST 20221227042858342.inflight inflight 2022-12-27 pm12:29:16 CST 20221227042858801.inflight inflight 2022-12-27 pm12:29:17 CST 20221227042859357.inflight inflight 2022-12-27 pm12:29:19 CST 20221227042906103.inflight inflight 2022-12-27 pm12:29:19 CST 20221227042901993.inflight inflight 2022-12-27 pm12:29:20 CST ... 20221227042858342.commit commit 2022-12-27 pm12:29:46 CST 20221227/0/20230106 20221227042906103.commit commit 2022-12-27 pm12:29:54 CST 20221227/0/20230108 20221227042858801.commit commit 2022-12-27 pm12:30:04 CST 20221227/0/20230107 20221227042859357.commit commit 2022-12-27 pm12:30:14 CST 20221227042855832.commit commit 2022-12-27 pm12:30:23 CST 20221227042901993.commit commit 2022-12-27 pm12:30:33 CST ...` **To Reproduce** Steps to reproduce the behavior: 1.Submit multiple commits at the same time to insert different partitions. The sync hive function is enabled for each commit. 2.The order in which all commits succeed is inconsistent with the order in which they were submitted. 3.Check whether the hive table has parititon for all inserts **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.11.1 * Spark version :3.2.1 * Hive version :XXX * Hadoop version :3.3.2 * Storage (HDFS/S3/GCS..) :S3 * Running on Docker? (yes/no) :no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
