stream2000 commented on PR #7627:
URL: https://github.com/apache/hudi/pull/7627#issuecomment-1552359857
> But still, we need to solve the archiving issue.
Yes, I'm trying to leverage state transient time to do the hive sync and am
facing the archiving issue. When syncing partitions to hive, it's hard to know
whether there are instants with larger state transient time in the archive
timeline so we need to load the whole archive timeline to get instants with
larger state transient time (which is in fact unacceptable for a large archive
timeline)
We need a fast path to know the max state transient time in the archive
timeline (just like a checkpoint idx? ). When `lastCommitTime` synced to hive
is greater than the max state transient time in the archive timeline, we don't
need to load the archive timeline at all.
We can implement this by adding a extra flag in some way or just simply
comparing the file modified time of the last archived log. Also, to accelerate
filtering out instants with greater state transient time in archive timeline we
can also compare the modification time of the archived log.
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]