Re: Data Snapshots in Ignite

Raul Kripalani Thu, 22 Oct 2015 05:34:24 -0700

Hey Dmitry,

Actually, there are so many possibilities around snapshotting that we're
thinking about what I feel are two distinct functionalities ;-)


While persistent snapshotting is indeed useful, what you describe is a
mechanism somewhere in the spectrum between archiving and backups, right? I
think this may be a nice to have, but not a priority. Reason being that
Ignite would typically be part of a Lambda Architecture where
recent/actionable data is in cache + storage, and historical data (entire
dataset) only in storage. So the data ingestion layer (e.g. glued by Kafka)
would take care of feeding the data into both a persistent store (e.g.
Cassandra) indexed by time and into Ignite. I believe most users already
have some degree of persistence backing Ignite, in order to allow them to
recover from an integral Ignite disaster, right?

What I had in mind is a functionality that Ignite currently lacks (unless
I'm mistaken): the possibility of executing multiple read-only actions
against a consistent view of (paused) cache data. If I understand
correctly, there's currently no way to tell Ignite: "hey! I want to launch
3 compute jobs, one after another, each taking 5 minutes, against an
*identical* set of data, i.e. against a snapshot of data; I don't want
these jobs to see any data changes even if they occur in the underlying
cache during this time".

This type of snapshots would be short-lived, hence persisting the entire
snapshot is questionable. But retaining entries throughout the snapshot's
lifespan can also be dangerous due to memory constraints. So... how would
be solve this dilemma? Ideas:

* Move only evicted / outdated entries that are still active in the scope
of a snapshot to persistent medium. We would need an indexing mechanism
that addresses the location of the data item (e.g. memory or offset N in
persistent file X). As data changes in the underlying cache, Ignite would
keep filling up a disk file with the previous state of the updated /
evicted items as they stood within a snapshot.

* Keep the snapshot only in memory and allow the user to specify a policy
on how to handle memory repletion while snapshots are active:

    * Cancel and discard the snapshot when memory usage reaches a certain
threshold. Interrupt any jobs / queries, etc. that were running and return
an exception.
    * Throttle cache operations while the snapshot is active and memory is
getting full (reliant on a threshold).

To me, the TTL is important if we're retaining entries in memory...

Regards,

*Raúl Kripalani*
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
http://blog.raulkr.net | twitter: @raulvk

On Thu, Oct 22, 2015 at 1:58 AM, Dmitriy Setrakyan <dsetrak...@apache.org>
wrote:

> On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <c...@apache.org>
> wrote:
>
> > I like it quite a bit, as well! Ticket would make the most sense as well,
> > so
> > there will be a single place to collect the design docs (if needed), etc.
> >
> > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > > I also really like the idea. One potential use case is fraud analysis
> in
> > > financial institutions. Rarely it makes sense to perform such analysis
> > on a
> > > life system, but rather a snapshot of some data needs to be taken and
> > > analyzed offline.
> > >
> > > I think snapshots should be saved to disk, so users could load them for
> > > analysis on a totally different cluster.
> >
> > I think disk persistence should be optional, not mandatory.
> >
>
> I would actually prefer to support disk-only snapshots. I think it will be
> difficult (double-the-work) to support both, in-memory and disk formats.
> Also, storing snapshots in-memory would require extra memory (a lot of it)
> for something that gets saved mainly for historic purposes or offline
> analysis.
>
>
> >
> > Cos
> >
> > > Raul, if you don’t mind, can you file a ticket and see if anyone in the
> > > community wants to pick it up?
> > >
> > > D.
> > >
> > > On Wed, Oct 21, 2015 at 5:51 AM, Sergi Vladykin <
> > sergi.vlady...@gmail.com>
> > > wrote:
> > >
> > > > Raul,
> > > >
> > > > Actually SQL indexes are already snapshotable. I'm not sure if it
> does
> > make
> > > > sense to make
> > > > the whole cache (with full cache API support) snapshotable, but I
> like
> > your
> > > > idea
> > > > about running multiple SQL statements against the same snapshot.
> > > >
> > > > Also I don't think that it is a good idea to keep snapshots for a
> long
> > > > time,
> > > > so I'd prefer to have typical AutoClosable API like:
> > > >
> > > > try (Snapshot s = ...) {
> > > >     s.query(...);
> > > >     s.query(...);
> > > >     s.query(...);
> > > > }
> > > >
> > > > Though I'm not sure when we will be able to get down to this.
> > > >
> > > > Sergi
> > > >
> > > > 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > LevelDb has a functionality called Snapshots which provides a
> > consistent
> > > > > read-only view of the DB at a given point in time, against which
> > queries
> > > > > can be executed.
> > > > >
> > > > > To my knowledge, this functionality doesn't exist in the world of
> > open
> > > > > source In-Memory Computing. Ignite could be an innovator here.
> > > > >
> > > > > Ignite Snapshots would allow queries, distributed closures,
> > map-reduce
> > > > > jobs, etc. It could be useful for Spark RDDs to avoid data shift
> > while
> > > > the
> > > > > computation is taking place (not sure if there's already some form
> of
> > > > > snapshotting, though). Same for IGFS.
> > > > >
> > > > > Example usage:
> > > > >
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().create();
> > > > >
> > > > >     // all three queries are executed against a view of the cache
> at
> > the
> > > > > point in time where it was snapshotted
> > > > >     snapshot.query("select ...");
> > > > >     snapshot.query("select ...");
> > > > >     snapshot.query("select ...");
> > > > >
> > > > > In fact, it would be awesome to be able to logically save this
> > snapshot
> > > > > with a name so that later jobs, queries, etc. can run on top of it,
> > e.g.:
> > > > >
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().create("abc");
> > > > >
> > > > >     // ...
> > > > >     // in another module of a distributed system, or in another
> > thread in
> > > > > parallel, use the saved snapshot
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().get("abc");
> > > > >     ....
> > > > >
> > > > > Named snapshotting can be dangerous due to data retention, e.g.
> > imagine
> > > > > keeping a snapshot for 2 weeks! So we should force the user to
> > specify a
> > > > > TTL:
> > > > >
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().create("abc", 2,
> TimeUnit.HOURS);
> > > > >
> > > > > Such functionality would allow for "reporting checkpoints" and
> "time
> > > > > travel", for example, where you want users to be able to query the
> > data
> > > > as
> > > > > it stood 1 hour ago, 2 hours ago, etc.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > P.S.: We do have some form of snapshotting in the Compute
> > checkpointing
> > > > > functionality – but my proposal is to generalise the notion.
> > > > >
> > > > > Regards,
> > > > >
> > > > > *Raúl Kripalani*
> > > > > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big
> > Data and
> > > > > Messaging Engineer
> > > > > http://about.me/raulkripalani |
> > http://www.linkedin.com/in/raulkripalani
> > > > > http://blog.raulkr.net | twitter: @raulvk
> > > > >
> > > >
> >
>

Re: Data Snapshots in Ignite

Reply via email to