Re: [DISCUSS] PIP-4 Support savepoint

Jingsong Li Tue, 30 May 2023 01:15:58 -0700

I think we can just throw exceptions for pure numeric tag names.

Iceberg's behavior looks confusing.


Best,
Jingsong

On Tue, May 30, 2023 at 3:40 PM yu zelin <[email protected]> wrote:
>
> Hi, Shammon,
>
> An intuitive way is use numeric string to indicate snapshot and non-numeric 
> string to indicate tag.
> For example:
>
> SELECT * FROM t VERSION AS OF 1  —to snapshot #1
> SELECT * FROM t VERSION AS OF ‘last_year’ —to tag `last_year`
>
> This is also how iceberg do [1].
>
> However, if we use this way, the tag name cannot be numeric string. I think 
> this is acceptable and I will add this to the document.
>
> Best,
> Yu Zelin
>
> [1] https://iceberg.apache.org/docs/latest/spark-queries/#sql
>
> > 2023年5月30日 12:17，Shammon FY <[email protected]> 写道：
> >
> > Hi zelin,
> >
> > Thanks for your update. I have one comment about Time Travel on savepoint.
> >
> > Currently we can use statement in spark for specific snapshot 1
> > SELECT * FROM t VERSION AS OF 1;
> >
> > My point is how can we distinguish between snapshot and savepoint when
> > users submit a statement as followed:
> > SELECT * FROM t VERSION AS OF <version value>;
> >
> > Best,
> > Shammon FY
> >
> > On Tue, May 30, 2023 at 11:37 AM yu zelin <[email protected]> wrote:
> >
> >> Hi, Jingsong,
> >>
> >> Thanks for your feedback.
> >>
> >> ## TAG ID
> >> It seems the id is useless currently. I’ll remove it.
> >>
> >> ## Time Travel Syntax
> >> Since tag id is removed, we can just use:
> >>
> >> SELECT * FROM t VERSION AS OF ’tag-name’
> >>
> >> to travel to a tag.
> >>
> >> ## Tag class
> >> I agree with you that we can reuse the Snapshot class. We can introduce
> >> `TagManager`
> >> only to manage tags.
> >>
> >> ## Expiring Snapshot
> >>> why not record it in ManifestEntry?
> >> This is because every time Paimon generate a snapshot, it will create new
> >> ManifestEntries
> >> for data files. Consider this scenario, if we record it in ManifestEntry,
> >> assuming     we commit
> >> data file A to snapshot #1, we will get manifest entry Entry#1 as [ADD,
> >> A, commit at #1].
> >> Then we commit -A to snapshot #2, we will get manifest entry Entry#2 as
> >> [DELETE, A, ?],
> >> as you can see, we cannot know at which snapshot we commit the file A. So
> >> we have to
> >> record this information to data file meta directly.
> >>
> >>> We should note that "record it in `DataFileMeta` should be done before
> >> “tag”
> >> and document version compatibility.
> >>
> >> I will add message for this.
> >>
> >> Best,
> >> Yu Zelin
> >>
> >>
> >>> 2023年5月29日 10:29，Jingsong Li <[email protected]> 写道：
> >>>
> >>> Thanks Zelin for the update.
> >>>
> >>> ## TAG ID
> >>>
> >>> Is this useful? We have tag-name, snapshot-id, and now introducing a
> >>> tag id? What is used?
> >>>
> >>> ## Time Travel
> >>>
> >>> SELECT * FROM t VERSION AS OF tag-name.<name>
> >>>
> >>> This does not look like sql standard.
> >>>
> >>> Why do we introduce this `tag-name` prefix?
> >>>
> >>> ## Tag class
> >>>
> >>> Why not just use the Snapshot class? It looks like we don't need to
> >>> introduce Tag class. We can just copy the snapshot file to tag/.
> >>>
> >>> ## Expiring Snapshot
> >>>
> >>> We should note that "record it in `DataFileMeta`" should be done
> >>> before "tag". And document version compatibility.
> >>> And why not record it in ManifestEntry?
> >>>
> >>> Best,
> >>> Jingsong
> >>>
> >>> On Fri, May 26, 2023 at 11:15 AM yu zelin <[email protected]> wrote:
> >>>>
> >>>> Hi, all,
> >>>>
> >>>> FYI, I have updated the PIP [1].
> >>>>
> >>>> Main changes:
> >>>> - Use new name `tag`
> >>>> - Enrich Motivation
> >>>> - New Section `Data Files Handling` to describe how to determine a data
> >> files can be deleted.
> >>>>
> >>>> Best,
> >>>> Yu Zelin
> >>>>
> >>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw
> >>>>
> >>>>> 2023年5月24日 17:18，yu zelin <[email protected]> 写道：
> >>>>>
> >>>>> Hi, Guojun,
> >>>>>
> >>>>> I’d like to share my thoughts about your questions.
> >>>>>
> >>>>> 1. Expiration of savepoint
> >>>>> In my opinion, savepoints are created in a long interval, so there
> >> will not exist too many of them.
> >>>>> If users create a savepoint per day, there are 365 savepoints a year.
> >> So I didn’t consider expiration
> >>>>> of it, and I think provide a flink action like `delete-savepoint id =
> >> 1` is enough now.
> >>>>> But if it is really important, we can introduce table options to do
> >> so. I think we can do it like expiring
> >>>>> snapshots.
> >>>>>
> >>>>> 2. >   id of compacted snapshot picked by the savepoint
> >>>>> My initial idea is picking a compacted snapshot or doing compaction
> >> before creating savepoint. But
> >>>>> After discuss with Jingsong, I found it’s difficult. So now I suppose
> >> to directly create savepoint from
> >>>>> the given snapshot. Maybe we can optimize it later.
> >>>>> The changes will be updated soon.
> >>>>>> manifest file list in system-table
> >>>>> I think manifest file is not very important for users. Users can find
> >> when a savepoint is created, and
> >>>>> get the savepoint id, then they can query it from the savepoint by the
> >> id. I did’t see what scenario
> >>>>> the users need the manifest file information. What do you think?
> >>>>>
> >>>>> Best,
> >>>>> Yu Zelin
> >>>>>
> >>>>>> 2023年5月24日 10:50，Guojun Li <[email protected]> 写道：
> >>>>>>
> >>>>>> Thanks zelin for bringing up the discussion. I'm thinking about:
> >>>>>> 1. How to manage the savepoints if there are no expiration mechanism,
> >> by
> >>>>>> the TTL management of storages or external script?
> >>>>>> 2. I think the id of compacted snapshot picked by the savepoint and
> >>>>>> manifest file list is also important information for users, could
> >> these
> >>>>>> information be stored in the system-table?
> >>>>>>
> >>>>>> Best,
> >>>>>> Guojun
> >>>>>>
> >>>>>> On Mon, May 22, 2023 at 9:13 PM Jingsong Li <[email protected]>
> >> wrote:
> >>>>>>
> >>>>>>> FYI
> >>>>>>>
> >>>>>>> The PIP lacks a table to show Discussion thread & Vote thread &
> >> ISSUE...
> >>>>>>>
> >>>>>>> Best
> >>>>>>> Jingsong
> >>>>>>>
> >>>>>>> On Mon, May 22, 2023 at 4:48 PM yu zelin <[email protected]>
> >> wrote:
> >>>>>>>>
> >>>>>>>> Hi, all,
> >>>>>>>>
> >>>>>>>> Thank all of you for your suggestions and questions. After reading
> >> your
> >>>>>>> suggestions, I adopt some of them and I want to share my opinions
> >> here.
> >>>>>>>>
> >>>>>>>> To make my statements more clear, I will still use the word
> >> `savepoint`.
> >>>>>>> When we make a consensus, the name may be changed.
> >>>>>>>>
> >>>>>>>> 1. The purposes of savepoint
> >>>>>>>>
> >>>>>>>> As Shammon mentioned, Flink and database also have the concept of
> >>>>>>> `savepoint`. So it’s better to clarify the purposes of our savepoint.
> >>>>>>> Thanks for Nicholas and Jingsong, I think your explanations are very
> >> clear.
> >>>>>>> I’d like to give my summary:
> >>>>>>>>
> >>>>>>>> (1) Fault recovery (or we can say disaster recovery). Users can ROLL
> >>>>>>> BACK to a savepoint if needed. If user rollbacks to a savepoint, the
> >> table
> >>>>>>> will hold the data in the savepoint and the data committed  after the
> >>>>>>> savepoint will be deleted. In this scenario we need savepoint because
> >>>>>>> snapshots may have expired, the savepoint can keep longer and save
> >> user’s
> >>>>>>> old data.
> >>>>>>>>
> >>>>>>>> (2) Record versions of data at a longer interval (typically daily
> >> level
> >>>>>>> or weekly level). With savepoint, user can query the old data in
> >> batch
> >>>>>>> mode. Comparing to copy records to a new table or merge incremental
> >> records
> >>>>>>> with old records (like using merge into in Hive), the savepoint is
> >> more
> >>>>>>> lightweight because we don’t copy data files, we just record the
> >> meta data
> >>>>>>> of them.
> >>>>>>>>
> >>>>>>>> As you can see, savepoint is very similar to snapshot. The
> >> differences
> >>>>>>> are:
> >>>>>>>>
> >>>>>>>> (1) Savepoint lives longer. In most cases, snapshot’s life time is
> >>>>>>> about several minutes to hours. We suppose the savepoint can live
> >> several
> >>>>>>> days, weeks, or even months.
> >>>>>>>>
> >>>>>>>> (2) Savepoint is mainly used for batch reading for historical data.
> >> In
> >>>>>>> this PIP, we don’t introduce streaming reading for savepoint.
> >>>>>>>>
> >>>>>>>> 2. Candidates of name
> >>>>>>>>
> >>>>>>>> I agree with Jingsong that we can use a new name. Since the purpose
> >> and
> >>>>>>> mechanism (savepoint is very similar to snapshot) of savepoint is
> >> similar
> >>>>>>> to `tag` in iceberg, maybe we can use `tag`.
> >>>>>>>>
> >>>>>>>> In my opinion, an alternative is `anchor`. All the snapshots are
> >> like
> >>>>>>> the navigation path of the streaming data, and an `anchor` can stop
> >> it in a
> >>>>>>> place.
> >>>>>>>>
> >>>>>>>> 3. Public table operations and options
> >>>>>>>>
> >>>>>>>> We supposed to expose some operations and table options for user to
> >>>>>>> manage the savepoint.
> >>>>>>>>
> >>>>>>>> (1) Operations (Currently for Flink)
> >>>>>>>> We provide flink actions to manage savepoints:
> >>>>>>>> create-savepoint: To generate a savepoint from latest snapshot.
> >>>>>>> Support to create from specified snapshot.
> >>>>>>>> delete-savepoint: To delete specified savepoint.
> >>>>>>>> rollback-to: To roll back to a specified savepoint.
> >>>>>>>>
> >>>>>>>> (2) Table options
> >>>>>>>> We suppose to provide options for creating savepoint periodically:
> >>>>>>>> savepoint.create-time: When to create the savepoint. Example: 00:00
> >>>>>>>> savepoint.create-interval: Interval between the creation of two
> >>>>>>> savepoints. Examples: 2 d.
> >>>>>>>> savepoint.time-retained: The maximum time of savepoints to retain.
> >>>>>>>>
> >>>>>>>> (3) Procedures (future work)
> >>>>>>>> Spark supports SQL extension. After we support Spark CALL
> >> statement, we
> >>>>>>> can provide procedures to create, delete or rollback to savepoint
> >> for Spark
> >>>>>>> users.
> >>>>>>>>
> >>>>>>>> Support of CALL is on the road map of Flink. In future version, we
> >> can
> >>>>>>> also support savepoint-related procedures for Flink users.
> >>>>>>>>
> >>>>>>>> 4. Expiration of data files
> >>>>>>>>
> >>>>>>>> Currently, when a snapshot is expired, data files that not be used
> >> by
> >>>>>>> other snapshots. After we introduce the savepoint, we must make sure
> >> the
> >>>>>>> data files saved by savepoint will not be deleted.
> >>>>>>>>
> >>>>>>>> Conversely,  when a savepoint is deleted, the data files that not be
> >>>>>>> used by existing snapshots and other savepoints will be deleted.
> >>>>>>>>
> >>>>>>>> I have wrote some POC codes to implement it. I will update the
> >> mechanism
> >>>>>>> in PIP soon.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Yu Zelin
> >>>>>>>>
> >>>>>>>>> 2023年5月21日 20:54，Jingsong Li <[email protected]> 写道：
> >>>>>>>>>
> >>>>>>>>> Thanks Yun for your information.
> >>>>>>>>>
> >>>>>>>>> We need to be careful to avoid confusion between Paimon and Flink
> >>>>>>>>> concepts about "savepoint"
> >>>>>>>>>
> >>>>>>>>> Maybe we don't have to insist on using this "savepoint", for
> >> example,
> >>>>>>>>> TAG is also a candidate just like Iceberg [1]
> >>>>>>>>>
> >>>>>>>>> [1] https://iceberg.apache.org/docs/latest/branching/
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jingsong
> >>>>>>>>>
> >>>>>>>>> On Sun, May 21, 2023 at 8:51 PM Jingsong Li <
> >> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Thanks Nicholas for your detailed requirements.
> >>>>>>>>>>
> >>>>>>>>>> We need to supplement user requirements in FLIP, which is mainly
> >> aimed
> >>>>>>>>>> at two purposes:
> >>>>>>>>>> 1. Fault recovery for data errors (named: restore or rollback-to)
> >>>>>>>>>> 2. Used to record versions at the day level (such as), targeting
> >>>>>>> batch queries
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jingsong
> >>>>>>>>>>
> >>>>>>>>>> On Sat, May 20, 2023 at 2:55 PM Yun Tang <[email protected]>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Guys,
> >>>>>>>>>>>
> >>>>>>>>>>> Since we use Paimon with Flink in most cases, I think we need to
> >>>>>>> identify the same word "savepoint" in different systems.
> >>>>>>>>>>>
> >>>>>>>>>>> For Flink, savepoint means:
> >>>>>>>>>>>
> >>>>>>>>>>> 1.  Triggered by users, not periodically triggered by the system
> >>>>>>> itself. However, this FLIP wants to support it created periodically.
> >>>>>>>>>>> 2.  Even the so-called incremental native savepoint [1], it will
> >>>>>>> not depend on the previous checkpoints or savepoints, it will still
> >> copy
> >>>>>>> files on DFS to the self-contained savepoint folder. However, from
> >> the
> >>>>>>> description of this FLIP about the deletion of expired snapshot
> >> files,
> >>>>>>> paimion savepoint will refer to the previously existing files
> >> directly.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't think we need to make the semantics of Paimon totally the
> >>>>>>> same as Flink's. However, we need to introduce a table to tell the
> >>>>>>> difference compared with Flink and discuss about the difference.
> >>>>>>>>>>>
> >>>>>>>>>>> [1]
> >>>>>>>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Semantic
> >>>>>>>>>>>
> >>>>>>>>>>> Best
> >>>>>>>>>>> Yun Tang
> >>>>>>>>>>> ________________________________
> >>>>>>>>>>> From: Nicholas Jiang <[email protected]>
> >>>>>>>>>>> Sent: Friday, May 19, 2023 17:40
> >>>>>>>>>>> To: [email protected] <[email protected]>
> >>>>>>>>>>> Subject: Re: [DISCUSS] PIP-4 Support savepoint
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Guys,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks Zelin for driving the savepoint proposal. I propose some
> >>>>>>> opinions for savepont:
> >>>>>>>>>>>
> >>>>>>>>>>> -- About "introduce savepoint for Paimon to persist full data in
> >> a
> >>>>>>> time point"
> >>>>>>>>>>>
> >>>>>>>>>>> The motivation of savepoint proposal is more like snapshot TTL
> >>>>>>> management. Actually, disaster recovery is very much mission
> >> critical for
> >>>>>>> any software. Especially when it comes to data systems, the impact
> >> could be
> >>>>>>> very serious leading to delay in business decisions or even wrong
> >> business
> >>>>>>> decisions at times. Savepoint is proposed to assist users in
> >> recovering
> >>>>>>> data from a previous state: "savepoint" and "restore".
> >>>>>>>>>>>
> >>>>>>>>>>> "savepoint" saves the Paimon table as of the commit time,
> >> therefore
> >>>>>>> if there is a savepoint, the data generated in the corresponding
> >> commit
> >>>>>>> could not be clean. Meanwhile, savepoint could let user restore the
> >> table
> >>>>>>> to this savepoint at a later point in time if need be. On similar
> >> lines,
> >>>>>>> savepoint cannot be triggered on a commit that is already cleaned up.
> >>>>>>> Savepoint is synonymous to taking a backup, just that we don't make
> >> a new
> >>>>>>> copy of the table, but just save the state of the table elegantly so
> >> that
> >>>>>>> we can restore it later when in need.
> >>>>>>>>>>>
> >>>>>>>>>>> "restore" lets you restore your table to one of the savepoint
> >>>>>>> commit. Meanwhile, it cannot be undone (or reversed) and so care
> >> should be
> >>>>>>> taken before doing a restore. At this time, Paimon would delete all
> >> data
> >>>>>>> files and commit files (timeline files) greater than the savepoint
> >> commit
> >>>>>>> to which the table is being restored.
> >>>>>>>>>>>
> >>>>>>>>>>> BTW, it's better to introduce snapshot view based on savepoint,
> >>>>>>> which could improve query performance of historical data for Paimon
> >> table.
> >>>>>>>>>>>
> >>>>>>>>>>> -- About Public API of savepont
> >>>>>>>>>>>
> >>>>>>>>>>> Current introduced savepoint interfaces in Public API are not
> >> enough
> >>>>>>> for users, for example, deleteSavepoint, restoreSavepoint etc.
> >>>>>>>>>>>
> >>>>>>>>>>> -- About "Paimon's savepoint need to be combined with Flink's
> >>>>>>> savepoint":
> >>>>>>>>>>>
> >>>>>>>>>>> If paimon supports savepoint mechanism and provides savepoint
> >>>>>>> interfaces, the integration with Flink's savepoint is not blocked
> >> for this
> >>>>>>> proposal.
> >>>>>>>>>>>
> >>>>>>>>>>> In summary, savepoint is not only used to improve the query
> >>>>>>> performance of historical data, but also used for disaster recovery
> >>>>>>> processing.
> >>>>>>>>>>>
> >>>>>>>>>>> On 2023/05/17 09:53:11 Jingsong Li wrote:
> >>>>>>>>>>>> What Shammon mentioned is interesting. I agree with what he said
> >>>>>>> about
> >>>>>>>>>>>> the differences in savepoints between databases and stream
> >>>>>>> computing.
> >>>>>>>>>>>>
> >>>>>>>>>>>> About "Paimon's savepoint need to be combined with Flink's
> >>>>>>> savepoint":
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think it is possible, but we may need to deal with this in
> >> another
> >>>>>>>>>>>> mechanism, because the snapshots after savepoint may expire. We
> >> need
> >>>>>>>>>>>> to compare data between two savepoints to generate incremental
> >> data
> >>>>>>>>>>>> for streaming read.
> >>>>>>>>>>>>
> >>>>>>>>>>>> But this may not need to block FLIP, it looks like the current
> >>>>>>> design
> >>>>>>>>>>>> does not break the future combination?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, May 17, 2023 at 5:33 PM Shammon FY <[email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Caizhi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for your comments. As you mentioned, I think we may
> >> need to
> >>>>>>> discuss
> >>>>>>>>>>>>> the role of savepoint in Paimon.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If I understand correctly, the main feature of savepoint in the
> >>>>>>> current PIP
> >>>>>>>>>>>>> is that the savepoint will not be expired, and users can
> >> perform a
> >>>>>>> query on
> >>>>>>>>>>>>> the savepoint according to time-travel. Besides that, there is
> >>>>>>> savepoint in
> >>>>>>>>>>>>> the database and Flink.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. Savepoint in database. The database can roll back table
> >> data to
> >>>>>>> the
> >>>>>>>>>>>>> specified 'version' based on savepoint. So the key point of
> >>>>>>> savepoint in
> >>>>>>>>>>>>> the database is to rollback data.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2. Savepoint in Flink. Users can trigger a savepoint with a
> >>>>>>> specific
> >>>>>>>>>>>>> 'path', and save all data of state to the savepoint for job.
> >> Then
> >>>>>>> users can
> >>>>>>>>>>>>> create a new job based on the savepoint to continue consuming
> >>>>>>> incremental
> >>>>>>>>>>>>> data. I think the core capabilities are: backup for a job, and
> >>>>>>> resume a job
> >>>>>>>>>>>>> based on the savepoint.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In addition to the above, Paimon may also face data write
> >>>>>>> corruption and
> >>>>>>>>>>>>> need to recover data based on the specified savepoint. So we
> >> may
> >>>>>>> need to
> >>>>>>>>>>>>> consider what abilities should Paimon savepoint need besides
> >> the
> >>>>>>> ones
> >>>>>>>>>>>>> mentioned in the current PIP?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Additionally, as mentioned above, Flink also has
> >>>>>>>>>>>>> savepoint mechanism. During the process of streaming data from
> >>>>>>> Flink to
> >>>>>>>>>>>>> Paimon, does Paimon's savepoint need to be combined with
> >> Flink's
> >>>>>>> savepoint?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Shammon FY
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, May 17, 2023 at 4:02 PM Caizhi Weng <
> >> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi developers!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks Zelin for bringing up the discussion. The proposal
> >> seems
> >>>>>>> good to me
> >>>>>>>>>>>>>> overall. However I'd also like to bring up a few options.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. As Jingsong mentioned, Savepoint class should not become a
> >>>>>>> public API,
> >>>>>>>>>>>>>> at least for now. What we need to discuss for the public API
> >> is
> >>>>>>> how the
> >>>>>>>>>>>>>> users can create or delete savepoints. For example, what the
> >>>>>>> table option
> >>>>>>>>>>>>>> looks like, what commands and options are provided for the
> >> Flink
> >>>>>>> action,
> >>>>>>>>>>>>>> etc.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Currently most Flink actions are related to streaming
> >>>>>>> processing, so
> >>>>>>>>>>>>>> only Flink can support them. However, savepoint creation and
> >>>>>>> deletion seems
> >>>>>>>>>>>>>> like a feature for batch processing. So aside from Flink
> >> actions,
> >>>>>>> shall we
> >>>>>>>>>>>>>> also provide something like Spark actions for savepoints?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I would also like to comment on Shammon's views.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Should we introduce an option for savepoint path which may be
> >>>>>>> different
> >>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of
> >> savepoint.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I don't see this is necessary. To backup a table the user just
> >>>>>>> need to copy
> >>>>>>>>>>>>>> all files from the table directory. Savepoint in Paimon, as
> >> far
> >>>>>>> as I
> >>>>>>>>>>>>>> understand, is mainly for users to review historical data, not
> >>>>>>> for backing
> >>>>>>>>>>>>>> up tables.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Will the savepoint copy data files from snapshot or only save
> >>>>>>> meta files?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It would be a heavy burden if a savepoint copies all its
> >> files.
> >>>>>>> As I
> >>>>>>>>>>>>>> mentioned above, savepoint is not for backing up tables.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> How can users create a new table and restore data from the
> >>>>>>> specified
> >>>>>>>>>>>>>>> savepoint?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This reminds me of savepoints in Flink. Still, savepoint is
> >> not
> >>>>>>> for backing
> >>>>>>>>>>>>>> up tables so I guess we don't need to support "restoring data"
> >>>>>>> from a
> >>>>>>>>>>>>>> savepoint.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道：
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks Zelin for initiating this discussion. I have some
> >>>>>>> comments:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. Should we introduce an option for savepoint path which
> >> may be
> >>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of
> >> savepoint.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2. Will the savepoint copy data files from snapshot or only
> >> save
> >>>>>>> meta
> >>>>>>>>>>>>>>> files? The description in the PIP "After we introduce
> >> savepoint,
> >>>>>>> we
> >>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>> also check if the data files are used by savepoints." looks
> >> like
> >>>>>>> we only
> >>>>>>>>>>>>>>> save meta files for savepoint.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3. How can users create a new table and restore data from the
> >>>>>>> specified
> >>>>>>>>>>>>>>> savepoint?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Shammon FY
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, May 17, 2023 at 10:19 AM Jingsong Li <
> >>>>>>> [email protected]>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks Zelin for driving.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Some comments:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. I think it's possible to advance `Proposed Changes` to
> >> the
> >>>>>>> top,
> >>>>>>>>>>>>>>>> Public API has no meaning if I don't know how to do it.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2. Public API, Savepoint and SavepointManager are not Public
> >>>>>>> API, only
> >>>>>>>>>>>>>>>> Flink action or configuration option should be public API.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 3.Maybe we can have a separate chapter to describe
> >>>>>>>>>>>>>>>> `savepoint.create-interval`, maybe 'Periodically
> >> savepoint'? It
> >>>>>>> is not
> >>>>>>>>>>>>>>>> just an interval, because the true user case is savepoint
> >> after
> >>>>>>> 0:00.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 4.About 'Interaction with Snapshot', to be continued ...
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Jingsong
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, May 16, 2023 at 7:07 PM yu zelin <
> >> [email protected]
> >>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi, Paimon Devs,
> >>>>>>>>>>>>>>>>> I’d like to start a discussion about PIP-4[1]. In this
> >>>>>>> PIP, I
> >>>>>>>>>>>>>> want
> >>>>>>>>>>>>>>>> to talk about why we need savepoint, and some thoughts about
> >>>>>>> managing
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> using savepoint. Look forward to your question and
> >> suggestions.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Yu Zelin
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> >>
>

Re: [DISCUSS] PIP-4 Support savepoint

Reply via email to