hudi-bot opened a new issue, #14969:
URL: https://github.com/apache/hudi/issues/14969

   If a user runs hive sync occationally and if archival kicked in and trimmed 
some commits and if there were partitions added during those commits which was 
never updated later, hive sync will miss out those partitions. 
   {code:java}
     LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", 
Getting commits since then");
     return 
TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
         .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
   } {code}
   bcoz, we for recurrent syncs, we always fetch new commits from timeline 
after the last synced instant and fetch commit metadata and go on to fetch the 
partitions added as part of it. 
   
    
   
   We can add a new config to hive sync tool to override this behavior. 
   
   --sync-all-partitions 
   
   when this config is set to true, we should ignore last synced instant and 
should go the below route which is done when syncing for the first time. 
   
    
   {code:java}
   if (!lastCommitTimeSynced.isPresent()) {
     LOG.info("Last commit time synced is not known, listing all partitions in 
" + basePath + ",FS :" + fs);
     HoodieLocalEngineContext engineContext = new 
HoodieLocalEngineContext(metaClient.getHadoopConf());
     return FSUtils.getAllPartitionPaths(engineContext, basePath, 
useFileListingFromMetadata, assumeDatePartitioning);
   } {code}
    
   
    
   
   Ref issue: 
   
   https://github.com/apache/hudi/issues/3890
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3068
   - Type: New Feature


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to