stream2000 commented on PR #7627:
URL: https://github.com/apache/hudi/pull/7627#issuecomment-1552359857

    > But still, we need to solve the archiving issue.
   
   Yes, I'm trying to leverage state transient time to do the hive sync and am 
facing the archiving issue. When syncing partitions to hive, it's hard to know 
whether there are instants with larger state transient time in the archive 
timeline so we need to load the whole archive timeline to get instants with 
larger state transient time (which is in fact unacceptable for a large archive 
timeline) 
   
   We need a fast path to know the max state transient time in the archive 
timeline (just like a checkpoint idx? ). When `lastCommitTime` synced to hive 
is greater than the max state transient time in the archive timeline, we don't 
need to load the archive timeline at all. 
   
   We can implement this by adding a extra flag in some way or just simply 
comparing the file modified time of the last archived log. Also, to accelerate 
filtering out instants with greater state transient time in archive timeline we 
can also compare the modification time of the archived log. 
   
   What do you think? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to