Hey Dmitry, Actually, there are so many possibilities around snapshotting that we're thinking about what I feel are two distinct functionalities ;-)
While persistent snapshotting is indeed useful, what you describe is a mechanism somewhere in the spectrum between archiving and backups, right? I think this may be a nice to have, but not a priority. Reason being that Ignite would typically be part of a Lambda Architecture where recent/actionable data is in cache + storage, and historical data (entire dataset) only in storage. So the data ingestion layer (e.g. glued by Kafka) would take care of feeding the data into both a persistent store (e.g. Cassandra) indexed by time and into Ignite. I believe most users already have some degree of persistence backing Ignite, in order to allow them to recover from an integral Ignite disaster, right? What I had in mind is a functionality that Ignite currently lacks (unless I'm mistaken): the possibility of executing multiple read-only actions against a consistent view of (paused) cache data. If I understand correctly, there's currently no way to tell Ignite: "hey! I want to launch 3 compute jobs, one after another, each taking 5 minutes, against an *identical* set of data, i.e. against a snapshot of data; I don't want these jobs to see any data changes even if they occur in the underlying cache during this time". This type of snapshots would be short-lived, hence persisting the entire snapshot is questionable. But retaining entries throughout the snapshot's lifespan can also be dangerous due to memory constraints. So... how would be solve this dilemma? Ideas: * Move only evicted / outdated entries that are still active in the scope of a snapshot to persistent medium. We would need an indexing mechanism that addresses the location of the data item (e.g. memory or offset N in persistent file X). As data changes in the underlying cache, Ignite would keep filling up a disk file with the previous state of the updated / evicted items as they stood within a snapshot. * Keep the snapshot only in memory and allow the user to specify a policy on how to handle memory repletion while snapshots are active: * Cancel and discard the snapshot when memory usage reaches a certain threshold. Interrupt any jobs / queries, etc. that were running and return an exception. * Throttle cache operations while the snapshot is active and memory is getting full (reliant on a threshold). To me, the TTL is important if we're retaining entries in memory... Regards, *Raúl Kripalani* PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and Messaging Engineer http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani http://blog.raulkr.net | twitter: @raulvk On Thu, Oct 22, 2015 at 1:58 AM, Dmitriy Setrakyan <dsetrak...@apache.org> wrote: > On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <c...@apache.org> > wrote: > > > I like it quite a bit, as well! Ticket would make the most sense as well, > > so > > there will be a single place to collect the design docs (if needed), etc. > > > > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote: > > > I also really like the idea. One potential use case is fraud analysis > in > > > financial institutions. Rarely it makes sense to perform such analysis > > on a > > > life system, but rather a snapshot of some data needs to be taken and > > > analyzed offline. > > > > > > I think snapshots should be saved to disk, so users could load them for > > > analysis on a totally different cluster. > > > > I think disk persistence should be optional, not mandatory. > > > > I would actually prefer to support disk-only snapshots. I think it will be > difficult (double-the-work) to support both, in-memory and disk formats. > Also, storing snapshots in-memory would require extra memory (a lot of it) > for something that gets saved mainly for historic purposes or offline > analysis. > > > > > > Cos > > > > > Raul, if you don’t mind, can you file a ticket and see if anyone in the > > > community wants to pick it up? > > > > > > D. > > > > > > On Wed, Oct 21, 2015 at 5:51 AM, Sergi Vladykin < > > sergi.vlady...@gmail.com> > > > wrote: > > > > > > > Raul, > > > > > > > > Actually SQL indexes are already snapshotable. I'm not sure if it > does > > make > > > > sense to make > > > > the whole cache (with full cache API support) snapshotable, but I > like > > your > > > > idea > > > > about running multiple SQL statements against the same snapshot. > > > > > > > > Also I don't think that it is a good idea to keep snapshots for a > long > > > > time, > > > > so I'd prefer to have typical AutoClosable API like: > > > > > > > > try (Snapshot s = ...) { > > > > s.query(...); > > > > s.query(...); > > > > s.query(...); > > > > } > > > > > > > > Though I'm not sure when we will be able to get down to this. > > > > > > > > Sergi > > > > > > > > 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>: > > > > > > > > > Hey guys, > > > > > > > > > > LevelDb has a functionality called Snapshots which provides a > > consistent > > > > > read-only view of the DB at a given point in time, against which > > queries > > > > > can be executed. > > > > > > > > > > To my knowledge, this functionality doesn't exist in the world of > > open > > > > > source In-Memory Computing. Ignite could be an innovator here. > > > > > > > > > > Ignite Snapshots would allow queries, distributed closures, > > map-reduce > > > > > jobs, etc. It could be useful for Spark RDDs to avoid data shift > > while > > > > the > > > > > computation is taking place (not sure if there's already some form > of > > > > > snapshotting, though). Same for IGFS. > > > > > > > > > > Example usage: > > > > > > > > > > IgniteCacheSnapshot snapshot = > > > > > ignite.cache("mycache").snapshots().create(); > > > > > > > > > > // all three queries are executed against a view of the cache > at > > the > > > > > point in time where it was snapshotted > > > > > snapshot.query("select ..."); > > > > > snapshot.query("select ..."); > > > > > snapshot.query("select ..."); > > > > > > > > > > In fact, it would be awesome to be able to logically save this > > snapshot > > > > > with a name so that later jobs, queries, etc. can run on top of it, > > e.g.: > > > > > > > > > > IgniteCacheSnapshot snapshot = > > > > > ignite.cache("mycache").snapshots().create("abc"); > > > > > > > > > > // ... > > > > > // in another module of a distributed system, or in another > > thread in > > > > > parallel, use the saved snapshot > > > > > IgniteCacheSnapshot snapshot = > > > > > ignite.cache("mycache").snapshots().get("abc"); > > > > > .... > > > > > > > > > > Named snapshotting can be dangerous due to data retention, e.g. > > imagine > > > > > keeping a snapshot for 2 weeks! So we should force the user to > > specify a > > > > > TTL: > > > > > > > > > > IgniteCacheSnapshot snapshot = > > > > > ignite.cache("mycache").snapshots().create("abc", 2, > TimeUnit.HOURS); > > > > > > > > > > Such functionality would allow for "reporting checkpoints" and > "time > > > > > travel", for example, where you want users to be able to query the > > data > > > > as > > > > > it stood 1 hour ago, 2 hours ago, etc. > > > > > > > > > > What do you think? > > > > > > > > > > P.S.: We do have some form of snapshotting in the Compute > > checkpointing > > > > > functionality – but my proposal is to generalise the notion. > > > > > > > > > > Regards, > > > > > > > > > > *Raúl Kripalani* > > > > > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big > > Data and > > > > > Messaging Engineer > > > > > http://about.me/raulkripalani | > > http://www.linkedin.com/in/raulkripalani > > > > > http://blog.raulkr.net | twitter: @raulvk > > > > > > > > > > > >