Re: [DISCUSS] PIP-4 Support savepoint

Nicholas Jiang Fri, 19 May 2023 02:40:12 -0700

Hi Guys,

Thanks Zelin for driving the savepoint proposal. I propose some opinions for 
savepont:


-- About "introduce savepoint for Paimon to persist full data in a time point"

The motivation of savepoint proposal is more like snapshot TTL management. 
Actually, disaster recovery is very much mission critical for any software. 
Especially when it comes to data systems, the impact could be very serious 
leading to delay in business decisions or even wrong business decisions at 
times. Savepoint is proposed to assist users in recovering data from a previous 
state: "savepoint" and "restore".

"savepoint" saves the Paimon table as of the commit time, therefore if there is 
a savepoint, the data generated in the corresponding commit could not be clean. 
Meanwhile, savepoint could let user restore the table to this savepoint at a 
later point in time if need be. On similar lines, savepoint cannot be triggered 
on a commit that is already cleaned up. Savepoint is synonymous to taking a 
backup, just that we don't make a new copy of the table, but just save the 
state of the table elegantly so that we can restore it later when in need.

"restore" lets you restore your table to one of the savepoint commit. 
Meanwhile, it cannot be undone (or reversed) and so care should be taken before 
doing a restore. At this time, Paimon would delete all data files and commit 
files (timeline files) greater than the savepoint commit to which the table is 
being restored. 

BTW, it's better to introduce snapshot view based on savepoint, which could 
improve query performance of historical data for Paimon table.

-- About Public API of savepont

Current introduced savepoint interfaces in Public API are not enough for users, 
for example, deleteSavepoint, restoreSavepoint etc.

-- About "Paimon's savepoint need to be combined with Flink's savepoint":

If paimon supports savepoint mechanism and provides savepoint interfaces, the 
integration with Flink's savepoint is not blocked for this proposal.

In summary, savepoint is not only used to improve the query performance of 
historical data, but also used for disaster recovery processing.

On 2023/05/17 09:53:11 Jingsong Li wrote:
> What Shammon mentioned is interesting. I agree with what he said about
> the differences in savepoints between databases and stream computing.
> 
> About "Paimon's savepoint need to be combined with Flink's savepoint":
> 
> I think it is possible, but we may need to deal with this in another
> mechanism, because the snapshots after savepoint may expire. We need
> to compare data between two savepoints to generate incremental data
> for streaming read.
> 
> But this may not need to block FLIP, it looks like the current design
> does not break the future combination?
> 
> Best,
> Jingsong
> 
> On Wed, May 17, 2023 at 5:33 PM Shammon FY <[email protected]> wrote:
> >
> > Hi Caizhi,
> >
> > Thanks for your comments. As you mentioned, I think we may need to discuss
> > the role of savepoint in Paimon.
> >
> > If I understand correctly, the main feature of savepoint in the current PIP
> > is that the savepoint will not be expired, and users can perform a query on
> > the savepoint according to time-travel. Besides that, there is savepoint in
> > the database and Flink.
> >
> > 1. Savepoint in database. The database can roll back table data to the
> > specified 'version' based on savepoint. So the key point of savepoint in
> > the database is to rollback data.
> >
> > 2. Savepoint in Flink. Users can trigger a savepoint with a specific
> > 'path', and save all data of state to the savepoint for job. Then users can
> > create a new job based on the savepoint to continue consuming incremental
> > data. I think the core capabilities are: backup for a job, and resume a job
> > based on the savepoint.
> >
> > In addition to the above, Paimon may also face data write corruption and
> > need to recover data based on the specified savepoint. So we may need to
> > consider what abilities should Paimon savepoint need besides the ones
> > mentioned in the current PIP?
> >
> > Additionally, as mentioned above, Flink also has
> > savepoint mechanism. During the process of streaming data from Flink to
> > Paimon, does Paimon's savepoint need to be combined with Flink's savepoint?
> >
> >
> > Best,
> > Shammon FY
> >
> >
> > On Wed, May 17, 2023 at 4:02 PM Caizhi Weng <[email protected]> wrote:
> >
> > > Hi developers!
> > >
> > > Thanks Zelin for bringing up the discussion. The proposal seems good to me
> > > overall. However I'd also like to bring up a few options.
> > >
> > > 1. As Jingsong mentioned, Savepoint class should not become a public API,
> > > at least for now. What we need to discuss for the public API is how the
> > > users can create or delete savepoints. For example, what the table option
> > > looks like, what commands and options are provided for the Flink action,
> > > etc.
> > >
> > > 2. Currently most Flink actions are related to streaming processing, so
> > > only Flink can support them. However, savepoint creation and deletion 
> > > seems
> > > like a feature for batch processing. So aside from Flink actions, shall we
> > > also provide something like Spark actions for savepoints?
> > >
> > > I would also like to comment on Shammon's views.
> > >
> > > Should we introduce an option for savepoint path which may be different
> > > > from 'warehouse'? Then users can backup the data of savepoint.
> > > >
> > >
> > > I don't see this is necessary. To backup a table the user just need to 
> > > copy
> > > all files from the table directory. Savepoint in Paimon, as far as I
> > > understand, is mainly for users to review historical data, not for backing
> > > up tables.
> > >
> > > Will the savepoint copy data files from snapshot or only save meta files?
> > > >
> > >
> > > It would be a heavy burden if a savepoint copies all its files. As I
> > > mentioned above, savepoint is not for backing up tables.
> > >
> > >  How can users create a new table and restore data from the specified
> > > > savepoint?
> > >
> > >
> > > This reminds me of savepoints in Flink. Still, savepoint is not for 
> > > backing
> > > up tables so I guess we don't need to support "restoring data" from a
> > > savepoint.
> > >
> > > Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道：
> > >
> > > > Thanks Zelin for initiating this discussion. I have some comments:
> > > >
> > > > 1. Should we introduce an option for savepoint path which may be
> > > different
> > > > from 'warehouse'? Then users can backup the data of savepoint.
> > > >
> > > > 2. Will the savepoint copy data files from snapshot or only save meta
> > > > files? The description in the PIP "After we introduce savepoint, we
> > > should
> > > > also check if the data files are used by savepoints." looks like we only
> > > > save meta files for savepoint.
> > > >
> > > > 3. How can users create a new table and restore data from the specified
> > > > savepoint?
> > > >
> > > > Best,
> > > > Shammon FY
> > > >
> > > >
> > > > On Wed, May 17, 2023 at 10:19 AM Jingsong Li <[email protected]>
> > > > wrote:
> > > >
> > > > > Thanks Zelin for driving.
> > > > >
> > > > > Some comments:
> > > > >
> > > > > 1. I think it's possible to advance `Proposed Changes` to the top,
> > > > > Public API has no meaning if I don't know how to do it.
> > > > >
> > > > > 2. Public API, Savepoint and SavepointManager are not Public API, only
> > > > > Flink action or configuration option should be public API.
> > > > >
> > > > > 3.Maybe we can have a separate chapter to describe
> > > > > `savepoint.create-interval`, maybe 'Periodically savepoint'? It is not
> > > > > just an interval, because the true user case is savepoint after 0:00.
> > > > >
> > > > > 4.About 'Interaction with Snapshot', to be continued ...
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Tue, May 16, 2023 at 7:07 PM yu zelin <[email protected]>
> > > wrote:
> > > > > >
> > > > > > Hi, Paimon Devs,
> > > > > >      I’d like to start a discussion about PIP-4[1]. In this PIP, I
> > > want
> > > > > to talk about why we need savepoint, and some thoughts about managing
> > > and
> > > > > using savepoint. Look forward to your question and suggestions.
> > > > > >
> > > > > > Best,
> > > > > > Yu Zelin
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/x/NxE0Dw
> > > > >
> > > >
> > >
>

Re: [DISCUSS] PIP-4 Support savepoint

Reply via email to