Hi, all, Does anyone have questions or feedbacks?
I will wait a while for your reply. If no, I’d like to start a vote later. Best, Yu Zelin > 2023年5月30日 16:19,yu zelin <[email protected]> 写道: > > I agree with you @Jingsong. > > Best, > Yu Zelin > >> 2023年5月30日 16:15,Jingsong Li <[email protected]> 写道: >> >> I think we can just throw exceptions for pure numeric tag names. >> >> Iceberg's behavior looks confusing. >> >> Best, >> Jingsong >> >> On Tue, May 30, 2023 at 3:40 PM yu zelin <[email protected]> wrote: >>> >>> Hi, Shammon, >>> >>> An intuitive way is use numeric string to indicate snapshot and non-numeric >>> string to indicate tag. >>> For example: >>> >>> SELECT * FROM t VERSION AS OF 1 —to snapshot #1 >>> SELECT * FROM t VERSION AS OF ‘last_year’ —to tag `last_year` >>> >>> This is also how iceberg do [1]. >>> >>> However, if we use this way, the tag name cannot be numeric string. I think >>> this is acceptable and I will add this to the document. >>> >>> Best, >>> Yu Zelin >>> >>> [1] https://iceberg.apache.org/docs/latest/spark-queries/#sql >>> >>>> 2023年5月30日 12:17,Shammon FY <[email protected]> 写道: >>>> >>>> Hi zelin, >>>> >>>> Thanks for your update. I have one comment about Time Travel on savepoint. >>>> >>>> Currently we can use statement in spark for specific snapshot 1 >>>> SELECT * FROM t VERSION AS OF 1; >>>> >>>> My point is how can we distinguish between snapshot and savepoint when >>>> users submit a statement as followed: >>>> SELECT * FROM t VERSION AS OF <version value>; >>>> >>>> Best, >>>> Shammon FY >>>> >>>> On Tue, May 30, 2023 at 11:37 AM yu zelin <[email protected]> wrote: >>>> >>>>> Hi, Jingsong, >>>>> >>>>> Thanks for your feedback. >>>>> >>>>> ## TAG ID >>>>> It seems the id is useless currently. I’ll remove it. >>>>> >>>>> ## Time Travel Syntax >>>>> Since tag id is removed, we can just use: >>>>> >>>>> SELECT * FROM t VERSION AS OF ’tag-name’ >>>>> >>>>> to travel to a tag. >>>>> >>>>> ## Tag class >>>>> I agree with you that we can reuse the Snapshot class. We can introduce >>>>> `TagManager` >>>>> only to manage tags. >>>>> >>>>> ## Expiring Snapshot >>>>>> why not record it in ManifestEntry? >>>>> This is because every time Paimon generate a snapshot, it will create new >>>>> ManifestEntries >>>>> for data files. Consider this scenario, if we record it in ManifestEntry, >>>>> assuming we commit >>>>> data file A to snapshot #1, we will get manifest entry Entry#1 as [ADD, >>>>> A, commit at #1]. >>>>> Then we commit -A to snapshot #2, we will get manifest entry Entry#2 as >>>>> [DELETE, A, ?], >>>>> as you can see, we cannot know at which snapshot we commit the file A. So >>>>> we have to >>>>> record this information to data file meta directly. >>>>> >>>>>> We should note that "record it in `DataFileMeta` should be done before >>>>> “tag” >>>>> and document version compatibility. >>>>> >>>>> I will add message for this. >>>>> >>>>> Best, >>>>> Yu Zelin >>>>> >>>>> >>>>>> 2023年5月29日 10:29,Jingsong Li <[email protected]> 写道: >>>>>> >>>>>> Thanks Zelin for the update. >>>>>> >>>>>> ## TAG ID >>>>>> >>>>>> Is this useful? We have tag-name, snapshot-id, and now introducing a >>>>>> tag id? What is used? >>>>>> >>>>>> ## Time Travel >>>>>> >>>>>> SELECT * FROM t VERSION AS OF tag-name.<name> >>>>>> >>>>>> This does not look like sql standard. >>>>>> >>>>>> Why do we introduce this `tag-name` prefix? >>>>>> >>>>>> ## Tag class >>>>>> >>>>>> Why not just use the Snapshot class? It looks like we don't need to >>>>>> introduce Tag class. We can just copy the snapshot file to tag/. >>>>>> >>>>>> ## Expiring Snapshot >>>>>> >>>>>> We should note that "record it in `DataFileMeta`" should be done >>>>>> before "tag". And document version compatibility. >>>>>> And why not record it in ManifestEntry? >>>>>> >>>>>> Best, >>>>>> Jingsong >>>>>> >>>>>> On Fri, May 26, 2023 at 11:15 AM yu zelin <[email protected]> wrote: >>>>>>> >>>>>>> Hi, all, >>>>>>> >>>>>>> FYI, I have updated the PIP [1]. >>>>>>> >>>>>>> Main changes: >>>>>>> - Use new name `tag` >>>>>>> - Enrich Motivation >>>>>>> - New Section `Data Files Handling` to describe how to determine a data >>>>> files can be deleted. >>>>>>> >>>>>>> Best, >>>>>>> Yu Zelin >>>>>>> >>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw >>>>>>> >>>>>>>> 2023年5月24日 17:18,yu zelin <[email protected]> 写道: >>>>>>>> >>>>>>>> Hi, Guojun, >>>>>>>> >>>>>>>> I’d like to share my thoughts about your questions. >>>>>>>> >>>>>>>> 1. Expiration of savepoint >>>>>>>> In my opinion, savepoints are created in a long interval, so there >>>>> will not exist too many of them. >>>>>>>> If users create a savepoint per day, there are 365 savepoints a year. >>>>> So I didn’t consider expiration >>>>>>>> of it, and I think provide a flink action like `delete-savepoint id = >>>>> 1` is enough now. >>>>>>>> But if it is really important, we can introduce table options to do >>>>> so. I think we can do it like expiring >>>>>>>> snapshots. >>>>>>>> >>>>>>>> 2. > id of compacted snapshot picked by the savepoint >>>>>>>> My initial idea is picking a compacted snapshot or doing compaction >>>>> before creating savepoint. But >>>>>>>> After discuss with Jingsong, I found it’s difficult. So now I suppose >>>>> to directly create savepoint from >>>>>>>> the given snapshot. Maybe we can optimize it later. >>>>>>>> The changes will be updated soon. >>>>>>>>> manifest file list in system-table >>>>>>>> I think manifest file is not very important for users. Users can find >>>>> when a savepoint is created, and >>>>>>>> get the savepoint id, then they can query it from the savepoint by the >>>>> id. I did’t see what scenario >>>>>>>> the users need the manifest file information. What do you think? >>>>>>>> >>>>>>>> Best, >>>>>>>> Yu Zelin >>>>>>>> >>>>>>>>> 2023年5月24日 10:50,Guojun Li <[email protected]> 写道: >>>>>>>>> >>>>>>>>> Thanks zelin for bringing up the discussion. I'm thinking about: >>>>>>>>> 1. How to manage the savepoints if there are no expiration mechanism, >>>>> by >>>>>>>>> the TTL management of storages or external script? >>>>>>>>> 2. I think the id of compacted snapshot picked by the savepoint and >>>>>>>>> manifest file list is also important information for users, could >>>>> these >>>>>>>>> information be stored in the system-table? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Guojun >>>>>>>>> >>>>>>>>> On Mon, May 22, 2023 at 9:13 PM Jingsong Li <[email protected]> >>>>> wrote: >>>>>>>>> >>>>>>>>>> FYI >>>>>>>>>> >>>>>>>>>> The PIP lacks a table to show Discussion thread & Vote thread & >>>>> ISSUE... >>>>>>>>>> >>>>>>>>>> Best >>>>>>>>>> Jingsong >>>>>>>>>> >>>>>>>>>> On Mon, May 22, 2023 at 4:48 PM yu zelin <[email protected]> >>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, all, >>>>>>>>>>> >>>>>>>>>>> Thank all of you for your suggestions and questions. After reading >>>>> your >>>>>>>>>> suggestions, I adopt some of them and I want to share my opinions >>>>> here. >>>>>>>>>>> >>>>>>>>>>> To make my statements more clear, I will still use the word >>>>> `savepoint`. >>>>>>>>>> When we make a consensus, the name may be changed. >>>>>>>>>>> >>>>>>>>>>> 1. The purposes of savepoint >>>>>>>>>>> >>>>>>>>>>> As Shammon mentioned, Flink and database also have the concept of >>>>>>>>>> `savepoint`. So it’s better to clarify the purposes of our savepoint. >>>>>>>>>> Thanks for Nicholas and Jingsong, I think your explanations are very >>>>> clear. >>>>>>>>>> I’d like to give my summary: >>>>>>>>>>> >>>>>>>>>>> (1) Fault recovery (or we can say disaster recovery). Users can ROLL >>>>>>>>>> BACK to a savepoint if needed. If user rollbacks to a savepoint, the >>>>> table >>>>>>>>>> will hold the data in the savepoint and the data committed after the >>>>>>>>>> savepoint will be deleted. In this scenario we need savepoint because >>>>>>>>>> snapshots may have expired, the savepoint can keep longer and save >>>>> user’s >>>>>>>>>> old data. >>>>>>>>>>> >>>>>>>>>>> (2) Record versions of data at a longer interval (typically daily >>>>> level >>>>>>>>>> or weekly level). With savepoint, user can query the old data in >>>>> batch >>>>>>>>>> mode. Comparing to copy records to a new table or merge incremental >>>>> records >>>>>>>>>> with old records (like using merge into in Hive), the savepoint is >>>>> more >>>>>>>>>> lightweight because we don’t copy data files, we just record the >>>>> meta data >>>>>>>>>> of them. >>>>>>>>>>> >>>>>>>>>>> As you can see, savepoint is very similar to snapshot. The >>>>> differences >>>>>>>>>> are: >>>>>>>>>>> >>>>>>>>>>> (1) Savepoint lives longer. In most cases, snapshot’s life time is >>>>>>>>>> about several minutes to hours. We suppose the savepoint can live >>>>> several >>>>>>>>>> days, weeks, or even months. >>>>>>>>>>> >>>>>>>>>>> (2) Savepoint is mainly used for batch reading for historical data. >>>>> In >>>>>>>>>> this PIP, we don’t introduce streaming reading for savepoint. >>>>>>>>>>> >>>>>>>>>>> 2. Candidates of name >>>>>>>>>>> >>>>>>>>>>> I agree with Jingsong that we can use a new name. Since the purpose >>>>> and >>>>>>>>>> mechanism (savepoint is very similar to snapshot) of savepoint is >>>>> similar >>>>>>>>>> to `tag` in iceberg, maybe we can use `tag`. >>>>>>>>>>> >>>>>>>>>>> In my opinion, an alternative is `anchor`. All the snapshots are >>>>> like >>>>>>>>>> the navigation path of the streaming data, and an `anchor` can stop >>>>> it in a >>>>>>>>>> place. >>>>>>>>>>> >>>>>>>>>>> 3. Public table operations and options >>>>>>>>>>> >>>>>>>>>>> We supposed to expose some operations and table options for user to >>>>>>>>>> manage the savepoint. >>>>>>>>>>> >>>>>>>>>>> (1) Operations (Currently for Flink) >>>>>>>>>>> We provide flink actions to manage savepoints: >>>>>>>>>>> create-savepoint: To generate a savepoint from latest snapshot. >>>>>>>>>> Support to create from specified snapshot. >>>>>>>>>>> delete-savepoint: To delete specified savepoint. >>>>>>>>>>> rollback-to: To roll back to a specified savepoint. >>>>>>>>>>> >>>>>>>>>>> (2) Table options >>>>>>>>>>> We suppose to provide options for creating savepoint periodically: >>>>>>>>>>> savepoint.create-time: When to create the savepoint. Example: 00:00 >>>>>>>>>>> savepoint.create-interval: Interval between the creation of two >>>>>>>>>> savepoints. Examples: 2 d. >>>>>>>>>>> savepoint.time-retained: The maximum time of savepoints to retain. >>>>>>>>>>> >>>>>>>>>>> (3) Procedures (future work) >>>>>>>>>>> Spark supports SQL extension. After we support Spark CALL >>>>> statement, we >>>>>>>>>> can provide procedures to create, delete or rollback to savepoint >>>>> for Spark >>>>>>>>>> users. >>>>>>>>>>> >>>>>>>>>>> Support of CALL is on the road map of Flink. In future version, we >>>>> can >>>>>>>>>> also support savepoint-related procedures for Flink users. >>>>>>>>>>> >>>>>>>>>>> 4. Expiration of data files >>>>>>>>>>> >>>>>>>>>>> Currently, when a snapshot is expired, data files that not be used >>>>> by >>>>>>>>>> other snapshots. After we introduce the savepoint, we must make sure >>>>> the >>>>>>>>>> data files saved by savepoint will not be deleted. >>>>>>>>>>> >>>>>>>>>>> Conversely, when a savepoint is deleted, the data files that not be >>>>>>>>>> used by existing snapshots and other savepoints will be deleted. >>>>>>>>>>> >>>>>>>>>>> I have wrote some POC codes to implement it. I will update the >>>>> mechanism >>>>>>>>>> in PIP soon. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Yu Zelin >>>>>>>>>>> >>>>>>>>>>>> 2023年5月21日 20:54,Jingsong Li <[email protected]> 写道: >>>>>>>>>>>> >>>>>>>>>>>> Thanks Yun for your information. >>>>>>>>>>>> >>>>>>>>>>>> We need to be careful to avoid confusion between Paimon and Flink >>>>>>>>>>>> concepts about "savepoint" >>>>>>>>>>>> >>>>>>>>>>>> Maybe we don't have to insist on using this "savepoint", for >>>>> example, >>>>>>>>>>>> TAG is also a candidate just like Iceberg [1] >>>>>>>>>>>> >>>>>>>>>>>> [1] https://iceberg.apache.org/docs/latest/branching/ >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Jingsong >>>>>>>>>>>> >>>>>>>>>>>> On Sun, May 21, 2023 at 8:51 PM Jingsong Li < >>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks Nicholas for your detailed requirements. >>>>>>>>>>>>> >>>>>>>>>>>>> We need to supplement user requirements in FLIP, which is mainly >>>>> aimed >>>>>>>>>>>>> at two purposes: >>>>>>>>>>>>> 1. Fault recovery for data errors (named: restore or rollback-to) >>>>>>>>>>>>> 2. Used to record versions at the day level (such as), targeting >>>>>>>>>> batch queries >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Jingsong >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 20, 2023 at 2:55 PM Yun Tang <[email protected]> >>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Guys, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Since we use Paimon with Flink in most cases, I think we need to >>>>>>>>>> identify the same word "savepoint" in different systems. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For Flink, savepoint means: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Triggered by users, not periodically triggered by the system >>>>>>>>>> itself. However, this FLIP wants to support it created periodically. >>>>>>>>>>>>>> 2. Even the so-called incremental native savepoint [1], it will >>>>>>>>>> not depend on the previous checkpoints or savepoints, it will still >>>>> copy >>>>>>>>>> files on DFS to the self-contained savepoint folder. However, from >>>>> the >>>>>>>>>> description of this FLIP about the deletion of expired snapshot >>>>> files, >>>>>>>>>> paimion savepoint will refer to the previously existing files >>>>> directly. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't think we need to make the semantics of Paimon totally the >>>>>>>>>> same as Flink's. However, we need to introduce a table to tell the >>>>>>>>>> difference compared with Flink and discuss about the difference. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Semantic >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best >>>>>>>>>>>>>> Yun Tang >>>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>>> From: Nicholas Jiang <[email protected]> >>>>>>>>>>>>>> Sent: Friday, May 19, 2023 17:40 >>>>>>>>>>>>>> To: [email protected] <[email protected]> >>>>>>>>>>>>>> Subject: Re: [DISCUSS] PIP-4 Support savepoint >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Guys, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Zelin for driving the savepoint proposal. I propose some >>>>>>>>>> opinions for savepont: >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- About "introduce savepoint for Paimon to persist full data in >>>>> a >>>>>>>>>> time point" >>>>>>>>>>>>>> >>>>>>>>>>>>>> The motivation of savepoint proposal is more like snapshot TTL >>>>>>>>>> management. Actually, disaster recovery is very much mission >>>>> critical for >>>>>>>>>> any software. Especially when it comes to data systems, the impact >>>>> could be >>>>>>>>>> very serious leading to delay in business decisions or even wrong >>>>> business >>>>>>>>>> decisions at times. Savepoint is proposed to assist users in >>>>> recovering >>>>>>>>>> data from a previous state: "savepoint" and "restore". >>>>>>>>>>>>>> >>>>>>>>>>>>>> "savepoint" saves the Paimon table as of the commit time, >>>>> therefore >>>>>>>>>> if there is a savepoint, the data generated in the corresponding >>>>> commit >>>>>>>>>> could not be clean. Meanwhile, savepoint could let user restore the >>>>> table >>>>>>>>>> to this savepoint at a later point in time if need be. On similar >>>>> lines, >>>>>>>>>> savepoint cannot be triggered on a commit that is already cleaned up. >>>>>>>>>> Savepoint is synonymous to taking a backup, just that we don't make >>>>> a new >>>>>>>>>> copy of the table, but just save the state of the table elegantly so >>>>> that >>>>>>>>>> we can restore it later when in need. >>>>>>>>>>>>>> >>>>>>>>>>>>>> "restore" lets you restore your table to one of the savepoint >>>>>>>>>> commit. Meanwhile, it cannot be undone (or reversed) and so care >>>>> should be >>>>>>>>>> taken before doing a restore. At this time, Paimon would delete all >>>>> data >>>>>>>>>> files and commit files (timeline files) greater than the savepoint >>>>> commit >>>>>>>>>> to which the table is being restored. >>>>>>>>>>>>>> >>>>>>>>>>>>>> BTW, it's better to introduce snapshot view based on savepoint, >>>>>>>>>> which could improve query performance of historical data for Paimon >>>>> table. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- About Public API of savepont >>>>>>>>>>>>>> >>>>>>>>>>>>>> Current introduced savepoint interfaces in Public API are not >>>>> enough >>>>>>>>>> for users, for example, deleteSavepoint, restoreSavepoint etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- About "Paimon's savepoint need to be combined with Flink's >>>>>>>>>> savepoint": >>>>>>>>>>>>>> >>>>>>>>>>>>>> If paimon supports savepoint mechanism and provides savepoint >>>>>>>>>> interfaces, the integration with Flink's savepoint is not blocked >>>>> for this >>>>>>>>>> proposal. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In summary, savepoint is not only used to improve the query >>>>>>>>>> performance of historical data, but also used for disaster recovery >>>>>>>>>> processing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2023/05/17 09:53:11 Jingsong Li wrote: >>>>>>>>>>>>>>> What Shammon mentioned is interesting. I agree with what he said >>>>>>>>>> about >>>>>>>>>>>>>>> the differences in savepoints between databases and stream >>>>>>>>>> computing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> About "Paimon's savepoint need to be combined with Flink's >>>>>>>>>> savepoint": >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think it is possible, but we may need to deal with this in >>>>> another >>>>>>>>>>>>>>> mechanism, because the snapshots after savepoint may expire. We >>>>> need >>>>>>>>>>>>>>> to compare data between two savepoints to generate incremental >>>>> data >>>>>>>>>>>>>>> for streaming read. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But this may not need to block FLIP, it looks like the current >>>>>>>>>> design >>>>>>>>>>>>>>> does not break the future combination? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Jingsong >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, May 17, 2023 at 5:33 PM Shammon FY <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Caizhi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for your comments. As you mentioned, I think we may >>>>> need to >>>>>>>>>> discuss >>>>>>>>>>>>>>>> the role of savepoint in Paimon. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If I understand correctly, the main feature of savepoint in the >>>>>>>>>> current PIP >>>>>>>>>>>>>>>> is that the savepoint will not be expired, and users can >>>>> perform a >>>>>>>>>> query on >>>>>>>>>>>>>>>> the savepoint according to time-travel. Besides that, there is >>>>>>>>>> savepoint in >>>>>>>>>>>>>>>> the database and Flink. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. Savepoint in database. The database can roll back table >>>>> data to >>>>>>>>>> the >>>>>>>>>>>>>>>> specified 'version' based on savepoint. So the key point of >>>>>>>>>> savepoint in >>>>>>>>>>>>>>>> the database is to rollback data. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2. Savepoint in Flink. Users can trigger a savepoint with a >>>>>>>>>> specific >>>>>>>>>>>>>>>> 'path', and save all data of state to the savepoint for job. >>>>> Then >>>>>>>>>> users can >>>>>>>>>>>>>>>> create a new job based on the savepoint to continue consuming >>>>>>>>>> incremental >>>>>>>>>>>>>>>> data. I think the core capabilities are: backup for a job, and >>>>>>>>>> resume a job >>>>>>>>>>>>>>>> based on the savepoint. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In addition to the above, Paimon may also face data write >>>>>>>>>> corruption and >>>>>>>>>>>>>>>> need to recover data based on the specified savepoint. So we >>>>> may >>>>>>>>>> need to >>>>>>>>>>>>>>>> consider what abilities should Paimon savepoint need besides >>>>> the >>>>>>>>>> ones >>>>>>>>>>>>>>>> mentioned in the current PIP? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Additionally, as mentioned above, Flink also has >>>>>>>>>>>>>>>> savepoint mechanism. During the process of streaming data from >>>>>>>>>> Flink to >>>>>>>>>>>>>>>> Paimon, does Paimon's savepoint need to be combined with >>>>> Flink's >>>>>>>>>> savepoint? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Shammon FY >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, May 17, 2023 at 4:02 PM Caizhi Weng < >>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi developers! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks Zelin for bringing up the discussion. The proposal >>>>> seems >>>>>>>>>> good to me >>>>>>>>>>>>>>>>> overall. However I'd also like to bring up a few options. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. As Jingsong mentioned, Savepoint class should not become a >>>>>>>>>> public API, >>>>>>>>>>>>>>>>> at least for now. What we need to discuss for the public API >>>>> is >>>>>>>>>> how the >>>>>>>>>>>>>>>>> users can create or delete savepoints. For example, what the >>>>>>>>>> table option >>>>>>>>>>>>>>>>> looks like, what commands and options are provided for the >>>>> Flink >>>>>>>>>> action, >>>>>>>>>>>>>>>>> etc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. Currently most Flink actions are related to streaming >>>>>>>>>> processing, so >>>>>>>>>>>>>>>>> only Flink can support them. However, savepoint creation and >>>>>>>>>> deletion seems >>>>>>>>>>>>>>>>> like a feature for batch processing. So aside from Flink >>>>> actions, >>>>>>>>>> shall we >>>>>>>>>>>>>>>>> also provide something like Spark actions for savepoints? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I would also like to comment on Shammon's views. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Should we introduce an option for savepoint path which may be >>>>>>>>>> different >>>>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of >>>>> savepoint. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't see this is necessary. To backup a table the user just >>>>>>>>>> need to copy >>>>>>>>>>>>>>>>> all files from the table directory. Savepoint in Paimon, as >>>>> far >>>>>>>>>> as I >>>>>>>>>>>>>>>>> understand, is mainly for users to review historical data, not >>>>>>>>>> for backing >>>>>>>>>>>>>>>>> up tables. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Will the savepoint copy data files from snapshot or only save >>>>>>>>>> meta files? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It would be a heavy burden if a savepoint copies all its >>>>> files. >>>>>>>>>> As I >>>>>>>>>>>>>>>>> mentioned above, savepoint is not for backing up tables. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> How can users create a new table and restore data from the >>>>>>>>>> specified >>>>>>>>>>>>>>>>>> savepoint? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This reminds me of savepoints in Flink. Still, savepoint is >>>>> not >>>>>>>>>> for backing >>>>>>>>>>>>>>>>> up tables so I guess we don't need to support "restoring data" >>>>>>>>>> from a >>>>>>>>>>>>>>>>> savepoint. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks Zelin for initiating this discussion. I have some >>>>>>>>>> comments: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. Should we introduce an option for savepoint path which >>>>> may be >>>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of >>>>> savepoint. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. Will the savepoint copy data files from snapshot or only >>>>> save >>>>>>>>>> meta >>>>>>>>>>>>>>>>>> files? The description in the PIP "After we introduce >>>>> savepoint, >>>>>>>>>> we >>>>>>>>>>>>>>>>> should >>>>>>>>>>>>>>>>>> also check if the data files are used by savepoints." looks >>>>> like >>>>>>>>>> we only >>>>>>>>>>>>>>>>>> save meta files for savepoint. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 3. How can users create a new table and restore data from the >>>>>>>>>> specified >>>>>>>>>>>>>>>>>> savepoint? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Shammon FY >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, May 17, 2023 at 10:19 AM Jingsong Li < >>>>>>>>>> [email protected]> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks Zelin for driving. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Some comments: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. I think it's possible to advance `Proposed Changes` to >>>>> the >>>>>>>>>> top, >>>>>>>>>>>>>>>>>>> Public API has no meaning if I don't know how to do it. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. Public API, Savepoint and SavepointManager are not Public >>>>>>>>>> API, only >>>>>>>>>>>>>>>>>>> Flink action or configuration option should be public API. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 3.Maybe we can have a separate chapter to describe >>>>>>>>>>>>>>>>>>> `savepoint.create-interval`, maybe 'Periodically >>>>> savepoint'? It >>>>>>>>>> is not >>>>>>>>>>>>>>>>>>> just an interval, because the true user case is savepoint >>>>> after >>>>>>>>>> 0:00. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 4.About 'Interaction with Snapshot', to be continued ... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Jingsong >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, May 16, 2023 at 7:07 PM yu zelin < >>>>> [email protected] >>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, Paimon Devs, >>>>>>>>>>>>>>>>>>>> I’d like to start a discussion about PIP-4[1]. In this >>>>>>>>>> PIP, I >>>>>>>>>>>>>>>>> want >>>>>>>>>>>>>>>>>>> to talk about why we need savepoint, and some thoughts about >>>>>>>>>> managing >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> using savepoint. Look forward to your question and >>>>> suggestions. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>> Yu Zelin >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> >>> >
