I agree with you @Jingsong. Best, Yu Zelin
> 2023年5月30日 16:15,Jingsong Li <[email protected]> 写道: > > I think we can just throw exceptions for pure numeric tag names. > > Iceberg's behavior looks confusing. > > Best, > Jingsong > > On Tue, May 30, 2023 at 3:40 PM yu zelin <[email protected]> wrote: >> >> Hi, Shammon, >> >> An intuitive way is use numeric string to indicate snapshot and non-numeric >> string to indicate tag. >> For example: >> >> SELECT * FROM t VERSION AS OF 1 —to snapshot #1 >> SELECT * FROM t VERSION AS OF ‘last_year’ —to tag `last_year` >> >> This is also how iceberg do [1]. >> >> However, if we use this way, the tag name cannot be numeric string. I think >> this is acceptable and I will add this to the document. >> >> Best, >> Yu Zelin >> >> [1] https://iceberg.apache.org/docs/latest/spark-queries/#sql >> >>> 2023年5月30日 12:17,Shammon FY <[email protected]> 写道: >>> >>> Hi zelin, >>> >>> Thanks for your update. I have one comment about Time Travel on savepoint. >>> >>> Currently we can use statement in spark for specific snapshot 1 >>> SELECT * FROM t VERSION AS OF 1; >>> >>> My point is how can we distinguish between snapshot and savepoint when >>> users submit a statement as followed: >>> SELECT * FROM t VERSION AS OF <version value>; >>> >>> Best, >>> Shammon FY >>> >>> On Tue, May 30, 2023 at 11:37 AM yu zelin <[email protected]> wrote: >>> >>>> Hi, Jingsong, >>>> >>>> Thanks for your feedback. >>>> >>>> ## TAG ID >>>> It seems the id is useless currently. I’ll remove it. >>>> >>>> ## Time Travel Syntax >>>> Since tag id is removed, we can just use: >>>> >>>> SELECT * FROM t VERSION AS OF ’tag-name’ >>>> >>>> to travel to a tag. >>>> >>>> ## Tag class >>>> I agree with you that we can reuse the Snapshot class. We can introduce >>>> `TagManager` >>>> only to manage tags. >>>> >>>> ## Expiring Snapshot >>>>> why not record it in ManifestEntry? >>>> This is because every time Paimon generate a snapshot, it will create new >>>> ManifestEntries >>>> for data files. Consider this scenario, if we record it in ManifestEntry, >>>> assuming we commit >>>> data file A to snapshot #1, we will get manifest entry Entry#1 as [ADD, >>>> A, commit at #1]. >>>> Then we commit -A to snapshot #2, we will get manifest entry Entry#2 as >>>> [DELETE, A, ?], >>>> as you can see, we cannot know at which snapshot we commit the file A. So >>>> we have to >>>> record this information to data file meta directly. >>>> >>>>> We should note that "record it in `DataFileMeta` should be done before >>>> “tag” >>>> and document version compatibility. >>>> >>>> I will add message for this. >>>> >>>> Best, >>>> Yu Zelin >>>> >>>> >>>>> 2023年5月29日 10:29,Jingsong Li <[email protected]> 写道: >>>>> >>>>> Thanks Zelin for the update. >>>>> >>>>> ## TAG ID >>>>> >>>>> Is this useful? We have tag-name, snapshot-id, and now introducing a >>>>> tag id? What is used? >>>>> >>>>> ## Time Travel >>>>> >>>>> SELECT * FROM t VERSION AS OF tag-name.<name> >>>>> >>>>> This does not look like sql standard. >>>>> >>>>> Why do we introduce this `tag-name` prefix? >>>>> >>>>> ## Tag class >>>>> >>>>> Why not just use the Snapshot class? It looks like we don't need to >>>>> introduce Tag class. We can just copy the snapshot file to tag/. >>>>> >>>>> ## Expiring Snapshot >>>>> >>>>> We should note that "record it in `DataFileMeta`" should be done >>>>> before "tag". And document version compatibility. >>>>> And why not record it in ManifestEntry? >>>>> >>>>> Best, >>>>> Jingsong >>>>> >>>>> On Fri, May 26, 2023 at 11:15 AM yu zelin <[email protected]> wrote: >>>>>> >>>>>> Hi, all, >>>>>> >>>>>> FYI, I have updated the PIP [1]. >>>>>> >>>>>> Main changes: >>>>>> - Use new name `tag` >>>>>> - Enrich Motivation >>>>>> - New Section `Data Files Handling` to describe how to determine a data >>>> files can be deleted. >>>>>> >>>>>> Best, >>>>>> Yu Zelin >>>>>> >>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw >>>>>> >>>>>>> 2023年5月24日 17:18,yu zelin <[email protected]> 写道: >>>>>>> >>>>>>> Hi, Guojun, >>>>>>> >>>>>>> I’d like to share my thoughts about your questions. >>>>>>> >>>>>>> 1. Expiration of savepoint >>>>>>> In my opinion, savepoints are created in a long interval, so there >>>> will not exist too many of them. >>>>>>> If users create a savepoint per day, there are 365 savepoints a year. >>>> So I didn’t consider expiration >>>>>>> of it, and I think provide a flink action like `delete-savepoint id = >>>> 1` is enough now. >>>>>>> But if it is really important, we can introduce table options to do >>>> so. I think we can do it like expiring >>>>>>> snapshots. >>>>>>> >>>>>>> 2. > id of compacted snapshot picked by the savepoint >>>>>>> My initial idea is picking a compacted snapshot or doing compaction >>>> before creating savepoint. But >>>>>>> After discuss with Jingsong, I found it’s difficult. So now I suppose >>>> to directly create savepoint from >>>>>>> the given snapshot. Maybe we can optimize it later. >>>>>>> The changes will be updated soon. >>>>>>>> manifest file list in system-table >>>>>>> I think manifest file is not very important for users. Users can find >>>> when a savepoint is created, and >>>>>>> get the savepoint id, then they can query it from the savepoint by the >>>> id. I did’t see what scenario >>>>>>> the users need the manifest file information. What do you think? >>>>>>> >>>>>>> Best, >>>>>>> Yu Zelin >>>>>>> >>>>>>>> 2023年5月24日 10:50,Guojun Li <[email protected]> 写道: >>>>>>>> >>>>>>>> Thanks zelin for bringing up the discussion. I'm thinking about: >>>>>>>> 1. How to manage the savepoints if there are no expiration mechanism, >>>> by >>>>>>>> the TTL management of storages or external script? >>>>>>>> 2. I think the id of compacted snapshot picked by the savepoint and >>>>>>>> manifest file list is also important information for users, could >>>> these >>>>>>>> information be stored in the system-table? >>>>>>>> >>>>>>>> Best, >>>>>>>> Guojun >>>>>>>> >>>>>>>> On Mon, May 22, 2023 at 9:13 PM Jingsong Li <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>>> FYI >>>>>>>>> >>>>>>>>> The PIP lacks a table to show Discussion thread & Vote thread & >>>> ISSUE... >>>>>>>>> >>>>>>>>> Best >>>>>>>>> Jingsong >>>>>>>>> >>>>>>>>> On Mon, May 22, 2023 at 4:48 PM yu zelin <[email protected]> >>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, all, >>>>>>>>>> >>>>>>>>>> Thank all of you for your suggestions and questions. After reading >>>> your >>>>>>>>> suggestions, I adopt some of them and I want to share my opinions >>>> here. >>>>>>>>>> >>>>>>>>>> To make my statements more clear, I will still use the word >>>> `savepoint`. >>>>>>>>> When we make a consensus, the name may be changed. >>>>>>>>>> >>>>>>>>>> 1. The purposes of savepoint >>>>>>>>>> >>>>>>>>>> As Shammon mentioned, Flink and database also have the concept of >>>>>>>>> `savepoint`. So it’s better to clarify the purposes of our savepoint. >>>>>>>>> Thanks for Nicholas and Jingsong, I think your explanations are very >>>> clear. >>>>>>>>> I’d like to give my summary: >>>>>>>>>> >>>>>>>>>> (1) Fault recovery (or we can say disaster recovery). Users can ROLL >>>>>>>>> BACK to a savepoint if needed. If user rollbacks to a savepoint, the >>>> table >>>>>>>>> will hold the data in the savepoint and the data committed after the >>>>>>>>> savepoint will be deleted. In this scenario we need savepoint because >>>>>>>>> snapshots may have expired, the savepoint can keep longer and save >>>> user’s >>>>>>>>> old data. >>>>>>>>>> >>>>>>>>>> (2) Record versions of data at a longer interval (typically daily >>>> level >>>>>>>>> or weekly level). With savepoint, user can query the old data in >>>> batch >>>>>>>>> mode. Comparing to copy records to a new table or merge incremental >>>> records >>>>>>>>> with old records (like using merge into in Hive), the savepoint is >>>> more >>>>>>>>> lightweight because we don’t copy data files, we just record the >>>> meta data >>>>>>>>> of them. >>>>>>>>>> >>>>>>>>>> As you can see, savepoint is very similar to snapshot. The >>>> differences >>>>>>>>> are: >>>>>>>>>> >>>>>>>>>> (1) Savepoint lives longer. In most cases, snapshot’s life time is >>>>>>>>> about several minutes to hours. We suppose the savepoint can live >>>> several >>>>>>>>> days, weeks, or even months. >>>>>>>>>> >>>>>>>>>> (2) Savepoint is mainly used for batch reading for historical data. >>>> In >>>>>>>>> this PIP, we don’t introduce streaming reading for savepoint. >>>>>>>>>> >>>>>>>>>> 2. Candidates of name >>>>>>>>>> >>>>>>>>>> I agree with Jingsong that we can use a new name. Since the purpose >>>> and >>>>>>>>> mechanism (savepoint is very similar to snapshot) of savepoint is >>>> similar >>>>>>>>> to `tag` in iceberg, maybe we can use `tag`. >>>>>>>>>> >>>>>>>>>> In my opinion, an alternative is `anchor`. All the snapshots are >>>> like >>>>>>>>> the navigation path of the streaming data, and an `anchor` can stop >>>> it in a >>>>>>>>> place. >>>>>>>>>> >>>>>>>>>> 3. Public table operations and options >>>>>>>>>> >>>>>>>>>> We supposed to expose some operations and table options for user to >>>>>>>>> manage the savepoint. >>>>>>>>>> >>>>>>>>>> (1) Operations (Currently for Flink) >>>>>>>>>> We provide flink actions to manage savepoints: >>>>>>>>>> create-savepoint: To generate a savepoint from latest snapshot. >>>>>>>>> Support to create from specified snapshot. >>>>>>>>>> delete-savepoint: To delete specified savepoint. >>>>>>>>>> rollback-to: To roll back to a specified savepoint. >>>>>>>>>> >>>>>>>>>> (2) Table options >>>>>>>>>> We suppose to provide options for creating savepoint periodically: >>>>>>>>>> savepoint.create-time: When to create the savepoint. Example: 00:00 >>>>>>>>>> savepoint.create-interval: Interval between the creation of two >>>>>>>>> savepoints. Examples: 2 d. >>>>>>>>>> savepoint.time-retained: The maximum time of savepoints to retain. >>>>>>>>>> >>>>>>>>>> (3) Procedures (future work) >>>>>>>>>> Spark supports SQL extension. After we support Spark CALL >>>> statement, we >>>>>>>>> can provide procedures to create, delete or rollback to savepoint >>>> for Spark >>>>>>>>> users. >>>>>>>>>> >>>>>>>>>> Support of CALL is on the road map of Flink. In future version, we >>>> can >>>>>>>>> also support savepoint-related procedures for Flink users. >>>>>>>>>> >>>>>>>>>> 4. Expiration of data files >>>>>>>>>> >>>>>>>>>> Currently, when a snapshot is expired, data files that not be used >>>> by >>>>>>>>> other snapshots. After we introduce the savepoint, we must make sure >>>> the >>>>>>>>> data files saved by savepoint will not be deleted. >>>>>>>>>> >>>>>>>>>> Conversely, when a savepoint is deleted, the data files that not be >>>>>>>>> used by existing snapshots and other savepoints will be deleted. >>>>>>>>>> >>>>>>>>>> I have wrote some POC codes to implement it. I will update the >>>> mechanism >>>>>>>>> in PIP soon. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Yu Zelin >>>>>>>>>> >>>>>>>>>>> 2023年5月21日 20:54,Jingsong Li <[email protected]> 写道: >>>>>>>>>>> >>>>>>>>>>> Thanks Yun for your information. >>>>>>>>>>> >>>>>>>>>>> We need to be careful to avoid confusion between Paimon and Flink >>>>>>>>>>> concepts about "savepoint" >>>>>>>>>>> >>>>>>>>>>> Maybe we don't have to insist on using this "savepoint", for >>>> example, >>>>>>>>>>> TAG is also a candidate just like Iceberg [1] >>>>>>>>>>> >>>>>>>>>>> [1] https://iceberg.apache.org/docs/latest/branching/ >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jingsong >>>>>>>>>>> >>>>>>>>>>> On Sun, May 21, 2023 at 8:51 PM Jingsong Li < >>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thanks Nicholas for your detailed requirements. >>>>>>>>>>>> >>>>>>>>>>>> We need to supplement user requirements in FLIP, which is mainly >>>> aimed >>>>>>>>>>>> at two purposes: >>>>>>>>>>>> 1. Fault recovery for data errors (named: restore or rollback-to) >>>>>>>>>>>> 2. Used to record versions at the day level (such as), targeting >>>>>>>>> batch queries >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Jingsong >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 20, 2023 at 2:55 PM Yun Tang <[email protected]> >>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Guys, >>>>>>>>>>>>> >>>>>>>>>>>>> Since we use Paimon with Flink in most cases, I think we need to >>>>>>>>> identify the same word "savepoint" in different systems. >>>>>>>>>>>>> >>>>>>>>>>>>> For Flink, savepoint means: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Triggered by users, not periodically triggered by the system >>>>>>>>> itself. However, this FLIP wants to support it created periodically. >>>>>>>>>>>>> 2. Even the so-called incremental native savepoint [1], it will >>>>>>>>> not depend on the previous checkpoints or savepoints, it will still >>>> copy >>>>>>>>> files on DFS to the self-contained savepoint folder. However, from >>>> the >>>>>>>>> description of this FLIP about the deletion of expired snapshot >>>> files, >>>>>>>>> paimion savepoint will refer to the previously existing files >>>> directly. >>>>>>>>>>>>> >>>>>>>>>>>>> I don't think we need to make the semantics of Paimon totally the >>>>>>>>> same as Flink's. However, we need to introduce a table to tell the >>>>>>>>> difference compared with Flink and discuss about the difference. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>> >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Semantic >>>>>>>>>>>>> >>>>>>>>>>>>> Best >>>>>>>>>>>>> Yun Tang >>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>> From: Nicholas Jiang <[email protected]> >>>>>>>>>>>>> Sent: Friday, May 19, 2023 17:40 >>>>>>>>>>>>> To: [email protected] <[email protected]> >>>>>>>>>>>>> Subject: Re: [DISCUSS] PIP-4 Support savepoint >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Guys, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks Zelin for driving the savepoint proposal. I propose some >>>>>>>>> opinions for savepont: >>>>>>>>>>>>> >>>>>>>>>>>>> -- About "introduce savepoint for Paimon to persist full data in >>>> a >>>>>>>>> time point" >>>>>>>>>>>>> >>>>>>>>>>>>> The motivation of savepoint proposal is more like snapshot TTL >>>>>>>>> management. Actually, disaster recovery is very much mission >>>> critical for >>>>>>>>> any software. Especially when it comes to data systems, the impact >>>> could be >>>>>>>>> very serious leading to delay in business decisions or even wrong >>>> business >>>>>>>>> decisions at times. Savepoint is proposed to assist users in >>>> recovering >>>>>>>>> data from a previous state: "savepoint" and "restore". >>>>>>>>>>>>> >>>>>>>>>>>>> "savepoint" saves the Paimon table as of the commit time, >>>> therefore >>>>>>>>> if there is a savepoint, the data generated in the corresponding >>>> commit >>>>>>>>> could not be clean. Meanwhile, savepoint could let user restore the >>>> table >>>>>>>>> to this savepoint at a later point in time if need be. On similar >>>> lines, >>>>>>>>> savepoint cannot be triggered on a commit that is already cleaned up. >>>>>>>>> Savepoint is synonymous to taking a backup, just that we don't make >>>> a new >>>>>>>>> copy of the table, but just save the state of the table elegantly so >>>> that >>>>>>>>> we can restore it later when in need. >>>>>>>>>>>>> >>>>>>>>>>>>> "restore" lets you restore your table to one of the savepoint >>>>>>>>> commit. Meanwhile, it cannot be undone (or reversed) and so care >>>> should be >>>>>>>>> taken before doing a restore. At this time, Paimon would delete all >>>> data >>>>>>>>> files and commit files (timeline files) greater than the savepoint >>>> commit >>>>>>>>> to which the table is being restored. >>>>>>>>>>>>> >>>>>>>>>>>>> BTW, it's better to introduce snapshot view based on savepoint, >>>>>>>>> which could improve query performance of historical data for Paimon >>>> table. >>>>>>>>>>>>> >>>>>>>>>>>>> -- About Public API of savepont >>>>>>>>>>>>> >>>>>>>>>>>>> Current introduced savepoint interfaces in Public API are not >>>> enough >>>>>>>>> for users, for example, deleteSavepoint, restoreSavepoint etc. >>>>>>>>>>>>> >>>>>>>>>>>>> -- About "Paimon's savepoint need to be combined with Flink's >>>>>>>>> savepoint": >>>>>>>>>>>>> >>>>>>>>>>>>> If paimon supports savepoint mechanism and provides savepoint >>>>>>>>> interfaces, the integration with Flink's savepoint is not blocked >>>> for this >>>>>>>>> proposal. >>>>>>>>>>>>> >>>>>>>>>>>>> In summary, savepoint is not only used to improve the query >>>>>>>>> performance of historical data, but also used for disaster recovery >>>>>>>>> processing. >>>>>>>>>>>>> >>>>>>>>>>>>> On 2023/05/17 09:53:11 Jingsong Li wrote: >>>>>>>>>>>>>> What Shammon mentioned is interesting. I agree with what he said >>>>>>>>> about >>>>>>>>>>>>>> the differences in savepoints between databases and stream >>>>>>>>> computing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> About "Paimon's savepoint need to be combined with Flink's >>>>>>>>> savepoint": >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think it is possible, but we may need to deal with this in >>>> another >>>>>>>>>>>>>> mechanism, because the snapshots after savepoint may expire. We >>>> need >>>>>>>>>>>>>> to compare data between two savepoints to generate incremental >>>> data >>>>>>>>>>>>>> for streaming read. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But this may not need to block FLIP, it looks like the current >>>>>>>>> design >>>>>>>>>>>>>> does not break the future combination? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Jingsong >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, May 17, 2023 at 5:33 PM Shammon FY <[email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Caizhi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for your comments. As you mentioned, I think we may >>>> need to >>>>>>>>> discuss >>>>>>>>>>>>>>> the role of savepoint in Paimon. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If I understand correctly, the main feature of savepoint in the >>>>>>>>> current PIP >>>>>>>>>>>>>>> is that the savepoint will not be expired, and users can >>>> perform a >>>>>>>>> query on >>>>>>>>>>>>>>> the savepoint according to time-travel. Besides that, there is >>>>>>>>> savepoint in >>>>>>>>>>>>>>> the database and Flink. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Savepoint in database. The database can roll back table >>>> data to >>>>>>>>> the >>>>>>>>>>>>>>> specified 'version' based on savepoint. So the key point of >>>>>>>>> savepoint in >>>>>>>>>>>>>>> the database is to rollback data. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. Savepoint in Flink. Users can trigger a savepoint with a >>>>>>>>> specific >>>>>>>>>>>>>>> 'path', and save all data of state to the savepoint for job. >>>> Then >>>>>>>>> users can >>>>>>>>>>>>>>> create a new job based on the savepoint to continue consuming >>>>>>>>> incremental >>>>>>>>>>>>>>> data. I think the core capabilities are: backup for a job, and >>>>>>>>> resume a job >>>>>>>>>>>>>>> based on the savepoint. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In addition to the above, Paimon may also face data write >>>>>>>>> corruption and >>>>>>>>>>>>>>> need to recover data based on the specified savepoint. So we >>>> may >>>>>>>>> need to >>>>>>>>>>>>>>> consider what abilities should Paimon savepoint need besides >>>> the >>>>>>>>> ones >>>>>>>>>>>>>>> mentioned in the current PIP? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Additionally, as mentioned above, Flink also has >>>>>>>>>>>>>>> savepoint mechanism. During the process of streaming data from >>>>>>>>> Flink to >>>>>>>>>>>>>>> Paimon, does Paimon's savepoint need to be combined with >>>> Flink's >>>>>>>>> savepoint? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Shammon FY >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, May 17, 2023 at 4:02 PM Caizhi Weng < >>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi developers! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks Zelin for bringing up the discussion. The proposal >>>> seems >>>>>>>>> good to me >>>>>>>>>>>>>>>> overall. However I'd also like to bring up a few options. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. As Jingsong mentioned, Savepoint class should not become a >>>>>>>>> public API, >>>>>>>>>>>>>>>> at least for now. What we need to discuss for the public API >>>> is >>>>>>>>> how the >>>>>>>>>>>>>>>> users can create or delete savepoints. For example, what the >>>>>>>>> table option >>>>>>>>>>>>>>>> looks like, what commands and options are provided for the >>>> Flink >>>>>>>>> action, >>>>>>>>>>>>>>>> etc. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2. Currently most Flink actions are related to streaming >>>>>>>>> processing, so >>>>>>>>>>>>>>>> only Flink can support them. However, savepoint creation and >>>>>>>>> deletion seems >>>>>>>>>>>>>>>> like a feature for batch processing. So aside from Flink >>>> actions, >>>>>>>>> shall we >>>>>>>>>>>>>>>> also provide something like Spark actions for savepoints? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would also like to comment on Shammon's views. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Should we introduce an option for savepoint path which may be >>>>>>>>> different >>>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of >>>> savepoint. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I don't see this is necessary. To backup a table the user just >>>>>>>>> need to copy >>>>>>>>>>>>>>>> all files from the table directory. Savepoint in Paimon, as >>>> far >>>>>>>>> as I >>>>>>>>>>>>>>>> understand, is mainly for users to review historical data, not >>>>>>>>> for backing >>>>>>>>>>>>>>>> up tables. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Will the savepoint copy data files from snapshot or only save >>>>>>>>> meta files? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It would be a heavy burden if a savepoint copies all its >>>> files. >>>>>>>>> As I >>>>>>>>>>>>>>>> mentioned above, savepoint is not for backing up tables. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> How can users create a new table and restore data from the >>>>>>>>> specified >>>>>>>>>>>>>>>>> savepoint? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This reminds me of savepoints in Flink. Still, savepoint is >>>> not >>>>>>>>> for backing >>>>>>>>>>>>>>>> up tables so I guess we don't need to support "restoring data" >>>>>>>>> from a >>>>>>>>>>>>>>>> savepoint. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks Zelin for initiating this discussion. I have some >>>>>>>>> comments: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. Should we introduce an option for savepoint path which >>>> may be >>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of >>>> savepoint. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. Will the savepoint copy data files from snapshot or only >>>> save >>>>>>>>> meta >>>>>>>>>>>>>>>>> files? The description in the PIP "After we introduce >>>> savepoint, >>>>>>>>> we >>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>>> also check if the data files are used by savepoints." looks >>>> like >>>>>>>>> we only >>>>>>>>>>>>>>>>> save meta files for savepoint. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 3. How can users create a new table and restore data from the >>>>>>>>> specified >>>>>>>>>>>>>>>>> savepoint? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> Shammon FY >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, May 17, 2023 at 10:19 AM Jingsong Li < >>>>>>>>> [email protected]> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks Zelin for driving. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Some comments: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. I think it's possible to advance `Proposed Changes` to >>>> the >>>>>>>>> top, >>>>>>>>>>>>>>>>>> Public API has no meaning if I don't know how to do it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. Public API, Savepoint and SavepointManager are not Public >>>>>>>>> API, only >>>>>>>>>>>>>>>>>> Flink action or configuration option should be public API. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 3.Maybe we can have a separate chapter to describe >>>>>>>>>>>>>>>>>> `savepoint.create-interval`, maybe 'Periodically >>>> savepoint'? It >>>>>>>>> is not >>>>>>>>>>>>>>>>>> just an interval, because the true user case is savepoint >>>> after >>>>>>>>> 0:00. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 4.About 'Interaction with Snapshot', to be continued ... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Jingsong >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, May 16, 2023 at 7:07 PM yu zelin < >>>> [email protected] >>>>>>>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi, Paimon Devs, >>>>>>>>>>>>>>>>>>> I’d like to start a discussion about PIP-4[1]. In this >>>>>>>>> PIP, I >>>>>>>>>>>>>>>> want >>>>>>>>>>>>>>>>>> to talk about why we need savepoint, and some thoughts about >>>>>>>>> managing >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>> using savepoint. Look forward to your question and >>>> suggestions. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Yu Zelin >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>> >>>> >>
