perfectcw opened a new issue, #7570:
URL: https://github.com/apache/hudi/issues/7570

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Issue:
   Lost some partitions when sync hive
   
   Background: 
   We have a data ingest pipeline, which ingest about 500 partitions per day. 
And the pipeline will submit multiple commits at the same time to insert 
different partitions. The sync hive function is enabled for each commit.
   
   _**And after all of commits succeed, we found that some partitions are 
missing in the hive table.**_
   
   The following is the analysis of log and hoodie files:
   For the hoodie files, shows six of the commits. Then it was found that only 
_20221227042858342_ & _20221227042906103_ two commits were synced to hive, and 
the rest of the partitions did not appear in hive table.
   
   I think the root cause is because of the mechanism of sync hive. When hudi 
sync hive after the commit is succeed, it will first get the latest synced 
commit, and then use the timestamp of this commit as a benchmark to check 
whether the new column and partition are added to the commit behind it, and if 
so, it will sync to hive.
   So if a commit A is submmitted before this latest synced commit B, but 
succeeds after commit B, so it will not be synced hive. Because of commit A's 
timestamp < commit B's timestamp, it won't be detected.
   
   Here is the log of commit 20221227042859357, we can see it get latest synced 
commit is 20221227042906103, which commit after 20221227042859357 itself. So 
the partition inserted by 20221227042859357 commit has not been detected, and 
the partition that needs to be synced is 0.
   
   `2022-12-27 04:30:16,449 INFO hive.metastore: Opened a connection to 
metastore, current connections: 1
   2022-12-27 04:30:16,465 INFO hive.metastore: Connected to metastore.
   2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Syncing target hoodie table 
with hive table(forecast_agg_hoover_multi_publish). Hive metastore URL 
:jdbc:hive2://hs2.presto.stg.aws.fwmrm.net:10000/;auth=noSasl, basePath 
:s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish
   2022-12-27 04:30:16,676 INFO hive.HiveSyncTool: Trying to sync hoodie table 
forecast_agg_hoover_multi_publish with base path 
s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish of type 
COPY_ON_WRITE
   2022-12-27 04:30:16,815 INFO table.TableSchemaResolver: Reading schema from 
s3a://fw1-stg-af-dip/hudi/forecast_agg_hoover_multi_publish/20221227/0/20230108/9820ce59-03a8-4efa-8978-3c3cf61298d8-0_1-11-3890_20221227042906103.parquet
   2022-12-27 04:30:16,904 INFO s3a.S3AInputStream: Switching to Random IO seek 
policy
   2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: No Schema difference for 
forecast_agg_hoover_multi_publish
   2022-12-27 04:30:17,477 INFO hive.HiveSyncTool: Schema sync complete. 
Syncing partitions for forecast_agg_hoover_multi_publish
   2022-12-27 04:30:17,525 INFO hive.HiveSyncTool: Last commit time synced was 
found to be 20221227042906103
   2022-12-27 04:30:17,525 INFO common.AbstractSyncHoodieClient: Last commit 
time synced is 20221227042906103, Getting commits since then
   2022-12-27 04:30:17,527 INFO hive.HiveSyncTool: Storage partitions scan 
complete. Found 0
   2022-12-27 04:30:17,697 INFO hive.HiveSyncTool: Sync complete for 
forecast_agg_hoover_multi_publish`
   
   
   `order by time 
   name                                                               type      
           last modify time                         partition                   
       if exist in hive
   20221227042855832.commit.requested   requested       2022-12-27 pm12:28:59 
CST   20221227/0/20230101         no
   20221227042858342.commit.requested   requested       2022-12-27 pm12:29:00 
CST   20221227/0/20230106        yes
   20221227042858801.commit.requested   requested       2022-12-27 pm12:29:01 
CST    20221227/0/20230107        no
   20221227042859357.commit.requested   requested       2022-12-27 pm12:29:01 
CST    20221227/0/20221229        no
   20221227042901993.commit.requested   requested       2022-12-27 pm12:29:04 
CST   20221227/0/20230103        no
   20221227042906103.commit.requested   requested       2022-12-27 pm12:29:08 
CST   20221227/0/20230108        yes
   ...
   20221227042855832.inflight                    inflight               
2022-12-27 pm12:29:16 CST
   20221227042858342.inflight                    inflight               
2022-12-27 pm12:29:16 CST
   20221227042858801.inflight                    inflight               
2022-12-27 pm12:29:17 CST
   20221227042859357.inflight                    inflight               
2022-12-27 pm12:29:19 CST
   20221227042906103.inflight                    inflight               
2022-12-27 pm12:29:19 CST
   20221227042901993.inflight                    inflight               
2022-12-27 pm12:29:20 CST
   ...
   20221227042858342.commit                      commit         2022-12-27 
pm12:29:46 CST   20221227/0/20230106                          
   20221227042906103.commit                      commit         2022-12-27 
pm12:29:54 CST   20221227/0/20230108                         
   20221227042858801.commit                      commit         2022-12-27 
pm12:30:04 CST   20221227/0/20230107 
   20221227042859357.commit                      commit         2022-12-27 
pm12:30:14 CST
   20221227042855832.commit                      commit         2022-12-27 
pm12:30:23 CST
   20221227042901993.commit                      commit         2022-12-27 
pm12:30:33 CST
   ...`
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.Submit multiple commits at the same time to insert different partitions. 
The sync hive function is enabled for each commit.
   2.The order in which all commits succeed is inconsistent with the order in 
which they were submitted.
   3.Check whether the hive table has parititon for all inserts
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.11.1
   
   * Spark version :3.2.1
   
   * Hive version :XXX
   
   * Hadoop version :3.3.2
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to