Re: IgniteCache.loadCache improvement proposal

Valentin Kulichenko Tue, 15 Nov 2016 12:13:02 -0800

It sounds like Aleksandr is basically proposing to support automatic
persistence [1] for loading through data streamer and we really don't have
this. However, I think I have more generic solution in mind.


What if we add one more IgniteCache.loadCache overload like this:

loadCache(@Nullable IgniteBiPredicate<K, V> p, IgniteBiInClosure<K, V>
clo, @Nullable
Object... args)

It's the same as the existing one, but with the key-value closure provided
as a parameter. This closure will be passed to the CacheStore.loadCache
along with the arguments and will allow to override the logic that actually
saves the loaded entry in cache (currently this logic is always provided by
the cache itself and user can't control it).

We can then provide the implementation of this closure that will create a
data streamer and call addData() within its apply() method.

I see the following advantages:

   - Any existing CacheStore implementation can be reused to load through
   streamer (our JDBC and Cassandra stores or anything else that user has).
   - Loading code is always part of CacheStore implementation, so it's very
   easy to switch between different ways of loading.
   - User is not limited by two approaches we provide out of the box, they
   can always implement a new one.

Thoughts?

[1] https://apacheignite.readme.io/docs/automatic-persistence

-Val

On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov <[email protected]>
wrote:

> Hi, All!
>
> I think we do not need to chage API at all.
>
> public void loadCache(@Nullable IgniteBiPredicate<K, V> p, @Nullable
> Object... args) throws CacheException;
>
> We could pass any args to loadCache();
>
> So we could create class
>  IgniteCacheLoadDescriptor {
>  some fields that will describe how to load
> }
>
>
> and modify POJO store to detect and use such arguments.
>
>
> All we need is to implement this and write good documentation and examples.
>
> Thoughts?
>
> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin <[email protected]>
> wrote:
>
> > Hi Vladimir,
> >
> > I don't offer any changes in API. Usage scenario is the same as it was
> > described in
> > https://apacheignite.readme.io/docs/persistent-store#section-loadcache-
> >
> > The preload cache logic invokes IgniteCache.loadCache() with some
> > additional arguments, depending on a CacheStore implementation, and then
> > the loading occurs in the way I've already described.
> >
> >
> > 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <[email protected]>:
> >
> > > Hi Alex,
> > >
> > > >>> Let's give the user the reusable code which is convenient, reliable
> > and
> > > fast.
> > > Convenience - this is why I asked for example on how API can look like
> > and
> > > how users are going to use it.
> > >
> > > Vladimir.
> > >
> > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I think the discussion goes a wrong direction. Certainly it's not a
> big
> > > > deal to implement some custom user logic to load the data into
> caches.
> > > But
> > > > Ignite framework gives the user some reusable code build on top of
> the
> > > > basic system.
> > > >
> > > > So the main question is: Why developers let the user to use
> convenient
> > > way
> > > > to load caches with totally non-optimal solution?
> > > >
> > > > We could talk too much about different persistence storage types, but
> > > > whenever we initiate the loading with IgniteCache.loadCache the
> current
> > > > implementation imposes much overhead on the network.
> > > >
> > > > Partition-aware data loading may be used in some scenarios to avoid
> > this
> > > > network overhead, but the users are compelled to do additional steps
> to
> > > > achieve this optimization: adding the column to tables, adding
> compound
> > > > indices including the added column, write a peace of repeatable code
> to
> > > > load the data in different caches in fault-tolerant fashion, etc.
> > > >
> > > > Let's give the user the reusable code which is convenient, reliable
> and
> > > > fast.
> > > >
> > > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko <
> > > > [email protected]>:
> > > >
> > > > > Hi Aleksandr,
> > > > >
> > > > > Data streamer is already outlined as one of the possible approaches
> > for
> > > > > loading the data [1]. Basically, you start a designated client node
> > or
> > > > > chose a leader among server nodes [1] and then use
> IgniteDataStreamer
> > > API
> > > > > to load the data. With this approach there is no need to have the
> > > > > CacheStore implementation at all. Can you please elaborate what
> > > > additional
> > > > > value are you trying to add here?
> > > > >
> > > > > [1] https://apacheignite.readme.io/docs/data-loading#
> > > ignitedatastreamer
> > > > > [2] https://apacheignite.readme.io/docs/leader-election
> > > > >
> > > > > -Val
> > > > >
> > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I just want to clarify a couple of API details from the original
> > > email
> > > > to
> > > > > > make sure that we are making the right assumptions here.
> > > > > >
> > > > > > *"Because of none keys are passed to the CacheStore.loadCache
> > > methods,
> > > > > the
> > > > > > > underlying implementation is forced to read all the data from
> the
> > > > > > > persistence storage"*
> > > > > >
> > > > > >
> > > > > > According to the javadoc, loadCache(...) method receives an
> > optional
> > > > > > argument from the user. You can pass anything you like,
> including a
> > > > list
> > > > > of
> > > > > > keys, or an SQL where clause, etc.
> > > > > >
> > > > > > *"The partition-aware data loading approach is not a choice. It
> > > > requires
> > > > > > > persistence of the volatile data depended on affinity function
> > > > > > > implementation and settings."*
> > > > > >
> > > > > >
> > > > > > This is only partially true. While Ignite allows to plugin custom
> > > > > affinity
> > > > > > functions, the affinity function is not something that changes
> > > > > dynamically
> > > > > > and should always return the same partition for the same key.So,
> > the
> > > > > > partition assignments are not volatile at all. If, in some very
> > rare
> > > > > case,
> > > > > > the partition assignment logic needs to change, then you could
> > update
> > > > the
> > > > > > partition assignments that you may have persisted elsewhere as
> > well,
> > > > e.g.
> > > > > > database.
> > > > > >
> > > > > > D.
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov <
> > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Alexandr, Alexey,
> > > > > > >
> > > > > > > While I agree with you that current cache loading logic is far
> > from
> > > > > > ideal,
> > > > > > > it would be cool to see API drafts based on your suggestions to
> > get
> > > > > > better
> > > > > > > understanding of your ideas. How exactly users are going to use
> > > your
> > > > > > > suggestions?
> > > > > > >
> > > > > > > My main concern is that initial load is not very trivial task
> in
> > > > > general
> > > > > > > case. Some users have centralized RDBMS systems, some have
> NoSQL,
> > > > > others
> > > > > > > work with distributed persistent stores (e.g. HDFS). Sometimes
> we
> > > > have
> > > > > > > Ignite nodes "near" persistent data, sometimes we don't.
> > Sharding,
> > > > > > > affinity, co-location, etc.. If we try to support all (or many)
> > > cases
> > > > > out
> > > > > > > of the box, we may end up in very messy and difficult API. So
> we
> > > > should
> > > > > > > carefully balance between simplicity, usability and
> feature-rich
> > > > > > > characteristics here.
> > > > > > >
> > > > > > > Personally, I think that if user is not satisfied with
> > > "loadCache()"
> > > > > API,
> > > > > > > he just writes simple closure with blackjack streamer and
> queries
> > > and
> > > > > > send
> > > > > > > it to whatever node he finds convenient. Not a big deal. Only
> > very
> > > > > common
> > > > > > > cases should be added to Ignite API.
> > > > > > >
> > > > > > > Vladimir.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov <
> > > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Looks good for me.
> > > > > > > >
> > > > > > > > But I will suggest to consider one more use-case:
> > > > > > > >
> > > > > > > > If user knows its data he could manually split loading.
> > > > > > > > For example: table Persons contains 10M rows.
> > > > > > > > User could provide something like:
> > > > > > > > cache.loadCache(null, "Person", "select * from Person where
> id
> > <
> > > > > > > > 1_000_000",
> > > > > > > > "Person", "select * from Person where id >=  1_000_000 and
> id <
> > > > > > > 2_000_000",
> > > > > > > > ....
> > > > > > > > "Person", "select * from Person where id >= 9_000_000 and id
> <
> > > > > > > 10_000_000",
> > > > > > > > );
> > > > > > > >
> > > > > > > > or may be it could be some descriptor object like
> > > > > > > >
> > > > > > > >  {
> > > > > > > >    sql: select * from Person where id >=  ? and id < ?"
> > > > > > > >    range: 0...10_000_000
> > > > > > > > }
> > > > > > > >
> > > > > > > > In this case provided queries will be send to mach nodes as
> > > number
> > > > of
> > > > > > > > queries.
> > > > > > > > And data will be loaded in parallel and for keys that a not
> > > local -
> > > > > > data
> > > > > > > > streamer
> > > > > > > > should be used (as described Alexandr description).
> > > > > > > >
> > > > > > > > I think it is a good issue for Ignite 2.0
> > > > > > > >
> > > > > > > > Vova, Val - what do you think?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin <
> > > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> All right,
> > > > > > > >>
> > > > > > > >> Let's assume a simple scenario. When the
> IgniteCache.loadCache
> > > is
> > > > > > > invoked,
> > > > > > > >> we check whether the cache is not local, and if so, then
> we'll
> > > > > > initiate
> > > > > > > >> the
> > > > > > > >> new loading logic.
> > > > > > > >>
> > > > > > > >> First, we take a "streamer" node, it could be done by
> > > > > > > >> utilizing LoadBalancingSpi, or it may be configured
> > statically,
> > > > for
> > > > > > the
> > > > > > > >> reason that the streamer node is running on the same host as
> > the
> > > > > > > >> persistence storage provider.
> > > > > > > >>
> > > > > > > >> After that we start the loading task on the streamer node
> > which
> > > > > > > >> creates IgniteDataStreamer and loads the cache with
> > > > > > > CacheStore.loadCache.
> > > > > > > >> Every call to IgniteBiInClosure.apply simply
> > > > > > > >> invokes IgniteDataStreamer.addData.
> > > > > > > >>
> > > > > > > >> This implementation will completely relieve overhead on the
> > > > > > persistence
> > > > > > > >> storage provider. Network overhead is also decreased in the
> > case
> > > > of
> > > > > > > >> partitioned caches. For two nodes we get 1-1/2 amount of
> data
> > > > > > > transferred
> > > > > > > >> by the network (1 part well be transferred from the
> > persistence
> > > > > > storage
> > > > > > > to
> > > > > > > >> the streamer, and then 1/2 from the streamer node to the
> > another
> > > > > > node).
> > > > > > > >> For
> > > > > > > >> three nodes it will be 1-2/3 and so on, up to the two times
> > > amount
> > > > > of
> > > > > > > data
> > > > > > > >> on the big clusters.
> > > > > > > >>
> > > > > > > >> I'd like to propose some additional optimization at this
> > place.
> > > If
> > > > > we
> > > > > > > have
> > > > > > > >> the streamer node on the same machine as the persistence
> > storage
> > > > > > > provider,
> > > > > > > >> then we completely relieve the network overhead as well. It
> > > could
> > > > > be a
> > > > > > > >> some
> > > > > > > >> special daemon node for the cache loading assigned in the
> > cache
> > > > > > > >> configuration, or an ordinary sever node as well.
> > > > > > > >>
> > > > > > > >> Certainly this calculations have been done in assumption
> that
> > we
> > > > > have
> > > > > > > even
> > > > > > > >> partitioned cache with only primary nodes (without backups).
> > In
> > > > the
> > > > > > case
> > > > > > > >> of
> > > > > > > >> one backup (the most frequent case I think), we get 2 amount
> > of
> > > > data
> > > > > > > >> transferred by the network on two nodes, 2-1/3 on three,
> 2-1/2
> > > on
> > > > > > four,
> > > > > > > >> and
> > > > > > > >> so on up to the three times amount of data on the big
> > clusters.
> > > > > Hence
> > > > > > > it's
> > > > > > > >> still better than the current implementation. In the worst
> > case
> > > > > with a
> > > > > > > >> fully replicated cache we take N+1 amount of data
> transferred
> > by
> > > > the
> > > > > > > >> network (where N is the number of nodes in the cluster). But
> > > it's
> > > > > not
> > > > > > a
> > > > > > > >> problem in small clusters, and a little overhead in big
> > > clusters.
> > > > > And
> > > > > > we
> > > > > > > >> still gain the persistence storage provider optimization.
> > > > > > > >>
> > > > > > > >> Now let's take more complex scenario. To achieve some level
> of
> > > > > > > >> parallelism,
> > > > > > > >> we could split our cluster on several groups. It could be a
> > > > > parameter
> > > > > > of
> > > > > > > >> the IgniteCache.loadCache method or a cache configuration
> > > option.
> > > > > The
> > > > > > > >> number of groups could be a fixed value, or it could be
> > > calculated
> > > > > > > >> dynamically by the maximum number of nodes in the group.
> > > > > > > >>
> > > > > > > >> After splitting the whole cluster on groups we will take the
> > > > > streamer
> > > > > > > node
> > > > > > > >> in the each group and submit the task for loading the cache
> > > > similar
> > > > > to
> > > > > > > the
> > > > > > > >> single streamer scenario, except as the only keys will be
> > passed
> > > > to
> > > > > > > >> the IgniteDataStreamer.addData method those correspond to
> the
> > > > > cluster
> > > > > > > >> group
> > > > > > > >> where is the streamer node running.
> > > > > > > >>
> > > > > > > >> In this case we get equal level of overhead as the
> > parallelism,
> > > > but
> > > > > > not
> > > > > > > so
> > > > > > > >> surplus as how many nodes in whole the cluster.
> > > > > > > >>
> > > > > > > >> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov <
> > > > [email protected]
> > > > > >:
> > > > > > > >>
> > > > > > > >> > Alexandr,
> > > > > > > >> >
> > > > > > > >> > Could you describe your proposal in more details?
> > > > > > > >> > Especially in case with several nodes.
> > > > > > > >> >
> > > > > > > >> > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin <
> > > > > > > >> [email protected]>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi,
> > > > > > > >> > >
> > > > > > > >> > > You know CacheStore API that is commonly used for
> > > > > > read/write-through
> > > > > > > >> > > relationship of the in-memory data with the persistence
> > > > storage.
> > > > > > > >> > >
> > > > > > > >> > > There is also IgniteCache.loadCache method for
> hot-loading
> > > the
> > > > > > cache
> > > > > > > >> on
> > > > > > > >> > > startup. Invocation of this method causes execution of
> > > > > > > >> > CacheStore.loadCache
> > > > > > > >> > > on the all nodes storing the cache partitions. Because
> of
> > > none
> > > > > > keys
> > > > > > > >> are
> > > > > > > >> > > passed to the CacheStore.loadCache methods, the
> underlying
> > > > > > > >> implementation
> > > > > > > >> > > is forced to read all the data from the persistence
> > storage,
> > > > but
> > > > > > > only
> > > > > > > >> > part
> > > > > > > >> > > of the data will be stored on each node.
> > > > > > > >> > >
> > > > > > > >> > > So, the current implementation have two general
> drawbacks:
> > > > > > > >> > >
> > > > > > > >> > > 1. Persistence storage is forced to perform as many
> > > identical
> > > > > > > queries
> > > > > > > >> as
> > > > > > > >> > > many nodes on the cluster. Each query may involve much
> > > > > additional
> > > > > > > >> > > computation on the persistence storage server.
> > > > > > > >> > >
> > > > > > > >> > > 2. Network is forced to transfer much more data, so
> > > obviously
> > > > > the
> > > > > > > big
> > > > > > > >> > > disadvantage on large systems.
> > > > > > > >> > >
> > > > > > > >> > > The partition-aware data loading approach, described in
> > > > > > > >> > > https://apacheignite.readme.
> io/docs/data-loading#section-
> > > > > > > >> > > partition-aware-data-loading
> > > > > > > >> > > , is not a choice. It requires persistence of the
> volatile
> > > > data
> > > > > > > >> depended
> > > > > > > >> > on
> > > > > > > >> > > affinity function implementation and settings.
> > > > > > > >> > >
> > > > > > > >> > > I propose using something like IgniteDataStreamer inside
> > > > > > > >> > > IgniteCache.loadCache implementation.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > --
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Alexandr Kuramshin
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> > Alexey Kuznetsov
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Thanks,
> > > > > > > >> Alexandr Kuramshin
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Alexey Kuznetsov
> > > > > > > > GridGain Systems
> > > > > > > > www.gridgain.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Alexandr Kuramshin
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Alexandr Kuramshin
> >
>
>
>
> --
> Alexey Kuznetsov
>

Re: IgniteCache.loadCache improvement proposal

Reply via email to