Thanks zelin, +1 to vote On Wed, May 31, 2023 at 10:04 AM Jingsong Li <[email protected]> wrote:
> +1 to vote > > On Tue, May 30, 2023 at 6:22 PM yu zelin <[email protected]> wrote: > > > > Hi, all, > > > > Does anyone have questions or feedbacks? > > > > I will wait a while for your reply. If no, I’d like to start a vote > later. > > > > Best, > > Yu Zelin > > > > > 2023年5月30日 16:19,yu zelin <[email protected]> 写道: > > > > > > I agree with you @Jingsong. > > > > > > Best, > > > Yu Zelin > > > > > >> 2023年5月30日 16:15,Jingsong Li <[email protected]> 写道: > > >> > > >> I think we can just throw exceptions for pure numeric tag names. > > >> > > >> Iceberg's behavior looks confusing. > > >> > > >> Best, > > >> Jingsong > > >> > > >> On Tue, May 30, 2023 at 3:40 PM yu zelin <[email protected]> > wrote: > > >>> > > >>> Hi, Shammon, > > >>> > > >>> An intuitive way is use numeric string to indicate snapshot and > non-numeric string to indicate tag. > > >>> For example: > > >>> > > >>> SELECT * FROM t VERSION AS OF 1 —to snapshot #1 > > >>> SELECT * FROM t VERSION AS OF ‘last_year’ —to tag `last_year` > > >>> > > >>> This is also how iceberg do [1]. > > >>> > > >>> However, if we use this way, the tag name cannot be numeric string. > I think this is acceptable and I will add this to the document. > > >>> > > >>> Best, > > >>> Yu Zelin > > >>> > > >>> [1] https://iceberg.apache.org/docs/latest/spark-queries/#sql > > >>> > > >>>> 2023年5月30日 12:17,Shammon FY <[email protected]> 写道: > > >>>> > > >>>> Hi zelin, > > >>>> > > >>>> Thanks for your update. I have one comment about Time Travel on > savepoint. > > >>>> > > >>>> Currently we can use statement in spark for specific snapshot 1 > > >>>> SELECT * FROM t VERSION AS OF 1; > > >>>> > > >>>> My point is how can we distinguish between snapshot and savepoint > when > > >>>> users submit a statement as followed: > > >>>> SELECT * FROM t VERSION AS OF <version value>; > > >>>> > > >>>> Best, > > >>>> Shammon FY > > >>>> > > >>>> On Tue, May 30, 2023 at 11:37 AM yu zelin <[email protected]> > wrote: > > >>>> > > >>>>> Hi, Jingsong, > > >>>>> > > >>>>> Thanks for your feedback. > > >>>>> > > >>>>> ## TAG ID > > >>>>> It seems the id is useless currently. I’ll remove it. > > >>>>> > > >>>>> ## Time Travel Syntax > > >>>>> Since tag id is removed, we can just use: > > >>>>> > > >>>>> SELECT * FROM t VERSION AS OF ’tag-name’ > > >>>>> > > >>>>> to travel to a tag. > > >>>>> > > >>>>> ## Tag class > > >>>>> I agree with you that we can reuse the Snapshot class. We can > introduce > > >>>>> `TagManager` > > >>>>> only to manage tags. > > >>>>> > > >>>>> ## Expiring Snapshot > > >>>>>> why not record it in ManifestEntry? > > >>>>> This is because every time Paimon generate a snapshot, it will > create new > > >>>>> ManifestEntries > > >>>>> for data files. Consider this scenario, if we record it in > ManifestEntry, > > >>>>> assuming we commit > > >>>>> data file A to snapshot #1, we will get manifest entry Entry#1 as > [ADD, > > >>>>> A, commit at #1]. > > >>>>> Then we commit -A to snapshot #2, we will get manifest entry > Entry#2 as > > >>>>> [DELETE, A, ?], > > >>>>> as you can see, we cannot know at which snapshot we commit the > file A. So > > >>>>> we have to > > >>>>> record this information to data file meta directly. > > >>>>> > > >>>>>> We should note that "record it in `DataFileMeta` should be done > before > > >>>>> “tag” > > >>>>> and document version compatibility. > > >>>>> > > >>>>> I will add message for this. > > >>>>> > > >>>>> Best, > > >>>>> Yu Zelin > > >>>>> > > >>>>> > > >>>>>> 2023年5月29日 10:29,Jingsong Li <[email protected]> 写道: > > >>>>>> > > >>>>>> Thanks Zelin for the update. > > >>>>>> > > >>>>>> ## TAG ID > > >>>>>> > > >>>>>> Is this useful? We have tag-name, snapshot-id, and now > introducing a > > >>>>>> tag id? What is used? > > >>>>>> > > >>>>>> ## Time Travel > > >>>>>> > > >>>>>> SELECT * FROM t VERSION AS OF tag-name.<name> > > >>>>>> > > >>>>>> This does not look like sql standard. > > >>>>>> > > >>>>>> Why do we introduce this `tag-name` prefix? > > >>>>>> > > >>>>>> ## Tag class > > >>>>>> > > >>>>>> Why not just use the Snapshot class? It looks like we don't need > to > > >>>>>> introduce Tag class. We can just copy the snapshot file to tag/. > > >>>>>> > > >>>>>> ## Expiring Snapshot > > >>>>>> > > >>>>>> We should note that "record it in `DataFileMeta`" should be done > > >>>>>> before "tag". And document version compatibility. > > >>>>>> And why not record it in ManifestEntry? > > >>>>>> > > >>>>>> Best, > > >>>>>> Jingsong > > >>>>>> > > >>>>>> On Fri, May 26, 2023 at 11:15 AM yu zelin <[email protected]> > wrote: > > >>>>>>> > > >>>>>>> Hi, all, > > >>>>>>> > > >>>>>>> FYI, I have updated the PIP [1]. > > >>>>>>> > > >>>>>>> Main changes: > > >>>>>>> - Use new name `tag` > > >>>>>>> - Enrich Motivation > > >>>>>>> - New Section `Data Files Handling` to describe how to determine > a data > > >>>>> files can be deleted. > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Yu Zelin > > >>>>>>> > > >>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw > > >>>>>>> > > >>>>>>>> 2023年5月24日 17:18,yu zelin <[email protected]> 写道: > > >>>>>>>> > > >>>>>>>> Hi, Guojun, > > >>>>>>>> > > >>>>>>>> I’d like to share my thoughts about your questions. > > >>>>>>>> > > >>>>>>>> 1. Expiration of savepoint > > >>>>>>>> In my opinion, savepoints are created in a long interval, so > there > > >>>>> will not exist too many of them. > > >>>>>>>> If users create a savepoint per day, there are 365 savepoints a > year. > > >>>>> So I didn’t consider expiration > > >>>>>>>> of it, and I think provide a flink action like > `delete-savepoint id = > > >>>>> 1` is enough now. > > >>>>>>>> But if it is really important, we can introduce table options > to do > > >>>>> so. I think we can do it like expiring > > >>>>>>>> snapshots. > > >>>>>>>> > > >>>>>>>> 2. > id of compacted snapshot picked by the savepoint > > >>>>>>>> My initial idea is picking a compacted snapshot or doing > compaction > > >>>>> before creating savepoint. But > > >>>>>>>> After discuss with Jingsong, I found it’s difficult. So now I > suppose > > >>>>> to directly create savepoint from > > >>>>>>>> the given snapshot. Maybe we can optimize it later. > > >>>>>>>> The changes will be updated soon. > > >>>>>>>>> manifest file list in system-table > > >>>>>>>> I think manifest file is not very important for users. Users > can find > > >>>>> when a savepoint is created, and > > >>>>>>>> get the savepoint id, then they can query it from the savepoint > by the > > >>>>> id. I did’t see what scenario > > >>>>>>>> the users need the manifest file information. What do you think? > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> Yu Zelin > > >>>>>>>> > > >>>>>>>>> 2023年5月24日 10:50,Guojun Li <[email protected]> 写道: > > >>>>>>>>> > > >>>>>>>>> Thanks zelin for bringing up the discussion. I'm thinking > about: > > >>>>>>>>> 1. How to manage the savepoints if there are no expiration > mechanism, > > >>>>> by > > >>>>>>>>> the TTL management of storages or external script? > > >>>>>>>>> 2. I think the id of compacted snapshot picked by the > savepoint and > > >>>>>>>>> manifest file list is also important information for users, > could > > >>>>> these > > >>>>>>>>> information be stored in the system-table? > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Guojun > > >>>>>>>>> > > >>>>>>>>> On Mon, May 22, 2023 at 9:13 PM Jingsong Li < > [email protected]> > > >>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> FYI > > >>>>>>>>>> > > >>>>>>>>>> The PIP lacks a table to show Discussion thread & Vote thread > & > > >>>>> ISSUE... > > >>>>>>>>>> > > >>>>>>>>>> Best > > >>>>>>>>>> Jingsong > > >>>>>>>>>> > > >>>>>>>>>> On Mon, May 22, 2023 at 4:48 PM yu zelin < > [email protected]> > > >>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi, all, > > >>>>>>>>>>> > > >>>>>>>>>>> Thank all of you for your suggestions and questions. After > reading > > >>>>> your > > >>>>>>>>>> suggestions, I adopt some of them and I want to share my > opinions > > >>>>> here. > > >>>>>>>>>>> > > >>>>>>>>>>> To make my statements more clear, I will still use the word > > >>>>> `savepoint`. > > >>>>>>>>>> When we make a consensus, the name may be changed. > > >>>>>>>>>>> > > >>>>>>>>>>> 1. The purposes of savepoint > > >>>>>>>>>>> > > >>>>>>>>>>> As Shammon mentioned, Flink and database also have the > concept of > > >>>>>>>>>> `savepoint`. So it’s better to clarify the purposes of our > savepoint. > > >>>>>>>>>> Thanks for Nicholas and Jingsong, I think your explanations > are very > > >>>>> clear. > > >>>>>>>>>> I’d like to give my summary: > > >>>>>>>>>>> > > >>>>>>>>>>> (1) Fault recovery (or we can say disaster recovery). Users > can ROLL > > >>>>>>>>>> BACK to a savepoint if needed. If user rollbacks to a > savepoint, the > > >>>>> table > > >>>>>>>>>> will hold the data in the savepoint and the data committed > after the > > >>>>>>>>>> savepoint will be deleted. In this scenario we need savepoint > because > > >>>>>>>>>> snapshots may have expired, the savepoint can keep longer and > save > > >>>>> user’s > > >>>>>>>>>> old data. > > >>>>>>>>>>> > > >>>>>>>>>>> (2) Record versions of data at a longer interval (typically > daily > > >>>>> level > > >>>>>>>>>> or weekly level). With savepoint, user can query the old data > in > > >>>>> batch > > >>>>>>>>>> mode. Comparing to copy records to a new table or merge > incremental > > >>>>> records > > >>>>>>>>>> with old records (like using merge into in Hive), the > savepoint is > > >>>>> more > > >>>>>>>>>> lightweight because we don’t copy data files, we just record > the > > >>>>> meta data > > >>>>>>>>>> of them. > > >>>>>>>>>>> > > >>>>>>>>>>> As you can see, savepoint is very similar to snapshot. The > > >>>>> differences > > >>>>>>>>>> are: > > >>>>>>>>>>> > > >>>>>>>>>>> (1) Savepoint lives longer. In most cases, snapshot’s life > time is > > >>>>>>>>>> about several minutes to hours. We suppose the savepoint can > live > > >>>>> several > > >>>>>>>>>> days, weeks, or even months. > > >>>>>>>>>>> > > >>>>>>>>>>> (2) Savepoint is mainly used for batch reading for > historical data. > > >>>>> In > > >>>>>>>>>> this PIP, we don’t introduce streaming reading for savepoint. > > >>>>>>>>>>> > > >>>>>>>>>>> 2. Candidates of name > > >>>>>>>>>>> > > >>>>>>>>>>> I agree with Jingsong that we can use a new name. Since the > purpose > > >>>>> and > > >>>>>>>>>> mechanism (savepoint is very similar to snapshot) of > savepoint is > > >>>>> similar > > >>>>>>>>>> to `tag` in iceberg, maybe we can use `tag`. > > >>>>>>>>>>> > > >>>>>>>>>>> In my opinion, an alternative is `anchor`. All the snapshots > are > > >>>>> like > > >>>>>>>>>> the navigation path of the streaming data, and an `anchor` > can stop > > >>>>> it in a > > >>>>>>>>>> place. > > >>>>>>>>>>> > > >>>>>>>>>>> 3. Public table operations and options > > >>>>>>>>>>> > > >>>>>>>>>>> We supposed to expose some operations and table options for > user to > > >>>>>>>>>> manage the savepoint. > > >>>>>>>>>>> > > >>>>>>>>>>> (1) Operations (Currently for Flink) > > >>>>>>>>>>> We provide flink actions to manage savepoints: > > >>>>>>>>>>> create-savepoint: To generate a savepoint from latest > snapshot. > > >>>>>>>>>> Support to create from specified snapshot. > > >>>>>>>>>>> delete-savepoint: To delete specified savepoint. > > >>>>>>>>>>> rollback-to: To roll back to a specified savepoint. > > >>>>>>>>>>> > > >>>>>>>>>>> (2) Table options > > >>>>>>>>>>> We suppose to provide options for creating savepoint > periodically: > > >>>>>>>>>>> savepoint.create-time: When to create the savepoint. > Example: 00:00 > > >>>>>>>>>>> savepoint.create-interval: Interval between the creation of > two > > >>>>>>>>>> savepoints. Examples: 2 d. > > >>>>>>>>>>> savepoint.time-retained: The maximum time of savepoints to > retain. > > >>>>>>>>>>> > > >>>>>>>>>>> (3) Procedures (future work) > > >>>>>>>>>>> Spark supports SQL extension. After we support Spark CALL > > >>>>> statement, we > > >>>>>>>>>> can provide procedures to create, delete or rollback to > savepoint > > >>>>> for Spark > > >>>>>>>>>> users. > > >>>>>>>>>>> > > >>>>>>>>>>> Support of CALL is on the road map of Flink. In future > version, we > > >>>>> can > > >>>>>>>>>> also support savepoint-related procedures for Flink users. > > >>>>>>>>>>> > > >>>>>>>>>>> 4. Expiration of data files > > >>>>>>>>>>> > > >>>>>>>>>>> Currently, when a snapshot is expired, data files that not > be used > > >>>>> by > > >>>>>>>>>> other snapshots. After we introduce the savepoint, we must > make sure > > >>>>> the > > >>>>>>>>>> data files saved by savepoint will not be deleted. > > >>>>>>>>>>> > > >>>>>>>>>>> Conversely, when a savepoint is deleted, the data files > that not be > > >>>>>>>>>> used by existing snapshots and other savepoints will be > deleted. > > >>>>>>>>>>> > > >>>>>>>>>>> I have wrote some POC codes to implement it. I will update > the > > >>>>> mechanism > > >>>>>>>>>> in PIP soon. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Yu Zelin > > >>>>>>>>>>> > > >>>>>>>>>>>> 2023年5月21日 20:54,Jingsong Li <[email protected]> 写道: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks Yun for your information. > > >>>>>>>>>>>> > > >>>>>>>>>>>> We need to be careful to avoid confusion between Paimon and > Flink > > >>>>>>>>>>>> concepts about "savepoint" > > >>>>>>>>>>>> > > >>>>>>>>>>>> Maybe we don't have to insist on using this "savepoint", for > > >>>>> example, > > >>>>>>>>>>>> TAG is also a candidate just like Iceberg [1] > > >>>>>>>>>>>> > > >>>>>>>>>>>> [1] https://iceberg.apache.org/docs/latest/branching/ > > >>>>>>>>>>>> > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Jingsong > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Sun, May 21, 2023 at 8:51 PM Jingsong Li < > > >>>>> [email protected]> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks Nicholas for your detailed requirements. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> We need to supplement user requirements in FLIP, which is > mainly > > >>>>> aimed > > >>>>>>>>>>>>> at two purposes: > > >>>>>>>>>>>>> 1. Fault recovery for data errors (named: restore or > rollback-to) > > >>>>>>>>>>>>> 2. Used to record versions at the day level (such as), > targeting > > >>>>>>>>>> batch queries > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Best, > > >>>>>>>>>>>>> Jingsong > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Sat, May 20, 2023 at 2:55 PM Yun Tang <[email protected] > > > > >>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi Guys, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Since we use Paimon with Flink in most cases, I think we > need to > > >>>>>>>>>> identify the same word "savepoint" in different systems. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> For Flink, savepoint means: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> 1. Triggered by users, not periodically triggered by the > system > > >>>>>>>>>> itself. However, this FLIP wants to support it created > periodically. > > >>>>>>>>>>>>>> 2. Even the so-called incremental native savepoint [1], > it will > > >>>>>>>>>> not depend on the previous checkpoints or savepoints, it will > still > > >>>>> copy > > >>>>>>>>>> files on DFS to the self-contained savepoint folder. However, > from > > >>>>> the > > >>>>>>>>>> description of this FLIP about the deletion of expired > snapshot > > >>>>> files, > > >>>>>>>>>> paimion savepoint will refer to the previously existing files > > >>>>> directly. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I don't think we need to make the semantics of Paimon > totally the > > >>>>>>>>>> same as Flink's. However, we need to introduce a table to > tell the > > >>>>>>>>>> difference compared with Flink and discuss about the > difference. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> [1] > > >>>>>>>>>> > > >>>>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Semantic > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Best > > >>>>>>>>>>>>>> Yun Tang > > >>>>>>>>>>>>>> ________________________________ > > >>>>>>>>>>>>>> From: Nicholas Jiang <[email protected]> > > >>>>>>>>>>>>>> Sent: Friday, May 19, 2023 17:40 > > >>>>>>>>>>>>>> To: [email protected] <[email protected]> > > >>>>>>>>>>>>>> Subject: Re: [DISCUSS] PIP-4 Support savepoint > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi Guys, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks Zelin for driving the savepoint proposal. I > propose some > > >>>>>>>>>> opinions for savepont: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> -- About "introduce savepoint for Paimon to persist full > data in > > >>>>> a > > >>>>>>>>>> time point" > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The motivation of savepoint proposal is more like > snapshot TTL > > >>>>>>>>>> management. Actually, disaster recovery is very much mission > > >>>>> critical for > > >>>>>>>>>> any software. Especially when it comes to data systems, the > impact > > >>>>> could be > > >>>>>>>>>> very serious leading to delay in business decisions or even > wrong > > >>>>> business > > >>>>>>>>>> decisions at times. Savepoint is proposed to assist users in > > >>>>> recovering > > >>>>>>>>>> data from a previous state: "savepoint" and "restore". > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> "savepoint" saves the Paimon table as of the commit time, > > >>>>> therefore > > >>>>>>>>>> if there is a savepoint, the data generated in the > corresponding > > >>>>> commit > > >>>>>>>>>> could not be clean. Meanwhile, savepoint could let user > restore the > > >>>>> table > > >>>>>>>>>> to this savepoint at a later point in time if need be. On > similar > > >>>>> lines, > > >>>>>>>>>> savepoint cannot be triggered on a commit that is already > cleaned up. > > >>>>>>>>>> Savepoint is synonymous to taking a backup, just that we > don't make > > >>>>> a new > > >>>>>>>>>> copy of the table, but just save the state of the table > elegantly so > > >>>>> that > > >>>>>>>>>> we can restore it later when in need. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> "restore" lets you restore your table to one of the > savepoint > > >>>>>>>>>> commit. Meanwhile, it cannot be undone (or reversed) and so > care > > >>>>> should be > > >>>>>>>>>> taken before doing a restore. At this time, Paimon would > delete all > > >>>>> data > > >>>>>>>>>> files and commit files (timeline files) greater than the > savepoint > > >>>>> commit > > >>>>>>>>>> to which the table is being restored. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> BTW, it's better to introduce snapshot view based on > savepoint, > > >>>>>>>>>> which could improve query performance of historical data for > Paimon > > >>>>> table. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> -- About Public API of savepont > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Current introduced savepoint interfaces in Public API are > not > > >>>>> enough > > >>>>>>>>>> for users, for example, deleteSavepoint, restoreSavepoint etc. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> -- About "Paimon's savepoint need to be combined with > Flink's > > >>>>>>>>>> savepoint": > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> If paimon supports savepoint mechanism and provides > savepoint > > >>>>>>>>>> interfaces, the integration with Flink's savepoint is not > blocked > > >>>>> for this > > >>>>>>>>>> proposal. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> In summary, savepoint is not only used to improve the > query > > >>>>>>>>>> performance of historical data, but also used for disaster > recovery > > >>>>>>>>>> processing. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On 2023/05/17 09:53:11 Jingsong Li wrote: > > >>>>>>>>>>>>>>> What Shammon mentioned is interesting. I agree with what > he said > > >>>>>>>>>> about > > >>>>>>>>>>>>>>> the differences in savepoints between databases and > stream > > >>>>>>>>>> computing. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> About "Paimon's savepoint need to be combined with > Flink's > > >>>>>>>>>> savepoint": > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I think it is possible, but we may need to deal with > this in > > >>>>> another > > >>>>>>>>>>>>>>> mechanism, because the snapshots after savepoint may > expire. We > > >>>>> need > > >>>>>>>>>>>>>>> to compare data between two savepoints to generate > incremental > > >>>>> data > > >>>>>>>>>>>>>>> for streaming read. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> But this may not need to block FLIP, it looks like the > current > > >>>>>>>>>> design > > >>>>>>>>>>>>>>> does not break the future combination? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>> Jingsong > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, May 17, 2023 at 5:33 PM Shammon FY < > [email protected]> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Hi Caizhi, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Thanks for your comments. As you mentioned, I think we > may > > >>>>> need to > > >>>>>>>>>> discuss > > >>>>>>>>>>>>>>>> the role of savepoint in Paimon. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> If I understand correctly, the main feature of > savepoint in the > > >>>>>>>>>> current PIP > > >>>>>>>>>>>>>>>> is that the savepoint will not be expired, and users can > > >>>>> perform a > > >>>>>>>>>> query on > > >>>>>>>>>>>>>>>> the savepoint according to time-travel. Besides that, > there is > > >>>>>>>>>> savepoint in > > >>>>>>>>>>>>>>>> the database and Flink. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> 1. Savepoint in database. The database can roll back > table > > >>>>> data to > > >>>>>>>>>> the > > >>>>>>>>>>>>>>>> specified 'version' based on savepoint. So the key > point of > > >>>>>>>>>> savepoint in > > >>>>>>>>>>>>>>>> the database is to rollback data. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> 2. Savepoint in Flink. Users can trigger a savepoint > with a > > >>>>>>>>>> specific > > >>>>>>>>>>>>>>>> 'path', and save all data of state to the savepoint for > job. > > >>>>> Then > > >>>>>>>>>> users can > > >>>>>>>>>>>>>>>> create a new job based on the savepoint to continue > consuming > > >>>>>>>>>> incremental > > >>>>>>>>>>>>>>>> data. I think the core capabilities are: backup for a > job, and > > >>>>>>>>>> resume a job > > >>>>>>>>>>>>>>>> based on the savepoint. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> In addition to the above, Paimon may also face data > write > > >>>>>>>>>> corruption and > > >>>>>>>>>>>>>>>> need to recover data based on the specified savepoint. > So we > > >>>>> may > > >>>>>>>>>> need to > > >>>>>>>>>>>>>>>> consider what abilities should Paimon savepoint need > besides > > >>>>> the > > >>>>>>>>>> ones > > >>>>>>>>>>>>>>>> mentioned in the current PIP? > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Additionally, as mentioned above, Flink also has > > >>>>>>>>>>>>>>>> savepoint mechanism. During the process of streaming > data from > > >>>>>>>>>> Flink to > > >>>>>>>>>>>>>>>> Paimon, does Paimon's savepoint need to be combined with > > >>>>> Flink's > > >>>>>>>>>> savepoint? > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>> Shammon FY > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Wed, May 17, 2023 at 4:02 PM Caizhi Weng < > > >>>>> [email protected]> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Hi developers! > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks Zelin for bringing up the discussion. The > proposal > > >>>>> seems > > >>>>>>>>>> good to me > > >>>>>>>>>>>>>>>>> overall. However I'd also like to bring up a few > options. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> 1. As Jingsong mentioned, Savepoint class should not > become a > > >>>>>>>>>> public API, > > >>>>>>>>>>>>>>>>> at least for now. What we need to discuss for the > public API > > >>>>> is > > >>>>>>>>>> how the > > >>>>>>>>>>>>>>>>> users can create or delete savepoints. For example, > what the > > >>>>>>>>>> table option > > >>>>>>>>>>>>>>>>> looks like, what commands and options are provided for > the > > >>>>> Flink > > >>>>>>>>>> action, > > >>>>>>>>>>>>>>>>> etc. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> 2. Currently most Flink actions are related to > streaming > > >>>>>>>>>> processing, so > > >>>>>>>>>>>>>>>>> only Flink can support them. However, savepoint > creation and > > >>>>>>>>>> deletion seems > > >>>>>>>>>>>>>>>>> like a feature for batch processing. So aside from > Flink > > >>>>> actions, > > >>>>>>>>>> shall we > > >>>>>>>>>>>>>>>>> also provide something like Spark actions for > savepoints? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I would also like to comment on Shammon's views. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Should we introduce an option for savepoint path which > may be > > >>>>>>>>>> different > > >>>>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of > > >>>>> savepoint. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I don't see this is necessary. To backup a table the > user just > > >>>>>>>>>> need to copy > > >>>>>>>>>>>>>>>>> all files from the table directory. Savepoint in > Paimon, as > > >>>>> far > > >>>>>>>>>> as I > > >>>>>>>>>>>>>>>>> understand, is mainly for users to review historical > data, not > > >>>>>>>>>> for backing > > >>>>>>>>>>>>>>>>> up tables. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Will the savepoint copy data files from snapshot or > only save > > >>>>>>>>>> meta files? > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> It would be a heavy burden if a savepoint copies all > its > > >>>>> files. > > >>>>>>>>>> As I > > >>>>>>>>>>>>>>>>> mentioned above, savepoint is not for backing up > tables. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> How can users create a new table and restore data from > the > > >>>>>>>>>> specified > > >>>>>>>>>>>>>>>>>> savepoint? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> This reminds me of savepoints in Flink. Still, > savepoint is > > >>>>> not > > >>>>>>>>>> for backing > > >>>>>>>>>>>>>>>>> up tables so I guess we don't need to support > "restoring data" > > >>>>>>>>>> from a > > >>>>>>>>>>>>>>>>> savepoint. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks Zelin for initiating this discussion. I have > some > > >>>>>>>>>> comments: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> 1. Should we introduce an option for savepoint path > which > > >>>>> may be > > >>>>>>>>>>>>>>>>> different > > >>>>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of > > >>>>> savepoint. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> 2. Will the savepoint copy data files from snapshot > or only > > >>>>> save > > >>>>>>>>>> meta > > >>>>>>>>>>>>>>>>>> files? The description in the PIP "After we introduce > > >>>>> savepoint, > > >>>>>>>>>> we > > >>>>>>>>>>>>>>>>> should > > >>>>>>>>>>>>>>>>>> also check if the data files are used by savepoints." > looks > > >>>>> like > > >>>>>>>>>> we only > > >>>>>>>>>>>>>>>>>> save meta files for savepoint. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> 3. How can users create a new table and restore data > from the > > >>>>>>>>>> specified > > >>>>>>>>>>>>>>>>>> savepoint? > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>> Shammon FY > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Wed, May 17, 2023 at 10:19 AM Jingsong Li < > > >>>>>>>>>> [email protected]> > > >>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Thanks Zelin for driving. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Some comments: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> 1. I think it's possible to advance `Proposed > Changes` to > > >>>>> the > > >>>>>>>>>> top, > > >>>>>>>>>>>>>>>>>>> Public API has no meaning if I don't know how to do > it. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> 2. Public API, Savepoint and SavepointManager are > not Public > > >>>>>>>>>> API, only > > >>>>>>>>>>>>>>>>>>> Flink action or configuration option should be > public API. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> 3.Maybe we can have a separate chapter to describe > > >>>>>>>>>>>>>>>>>>> `savepoint.create-interval`, maybe 'Periodically > > >>>>> savepoint'? It > > >>>>>>>>>> is not > > >>>>>>>>>>>>>>>>>>> just an interval, because the true user case is > savepoint > > >>>>> after > > >>>>>>>>>> 0:00. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> 4.About 'Interaction with Snapshot', to be continued > ... > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>>> Jingsong > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On Tue, May 16, 2023 at 7:07 PM yu zelin < > > >>>>> [email protected] > > >>>>>>>>>>> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi, Paimon Devs, > > >>>>>>>>>>>>>>>>>>>> I’d like to start a discussion about PIP-4[1]. In > this > > >>>>>>>>>> PIP, I > > >>>>>>>>>>>>>>>>> want > > >>>>>>>>>>>>>>>>>>> to talk about why we need savepoint, and some > thoughts about > > >>>>>>>>>> managing > > >>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>> using savepoint. Look forward to your question and > > >>>>> suggestions. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>>>> Yu Zelin > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>> > > >>>>> > > >>> > > > > > >
