Re: IgniteCache.loadCache improvement proposal

Denis Magda Tue, 15 Nov 2016 13:15:35 -0800

How would your proposal resolve the main point Aleksandr is trying to convey 
that is extensive network utilization?


As I see the loadCache method still will be triggered on every and as before 
all the nodes will pre-load all the data set from a database. That was 
Aleksandr’s reasonable concern. 

If we make up a way how to call the loadCache on a specific node only and 
implement some falt-tolerant mechanism then your suggestion should work 
perfectly fine.

—
Denis
 
> On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko 
> <valentin.kuliche...@gmail.com> wrote:
> 
> It sounds like Aleksandr is basically proposing to support automatic
> persistence [1] for loading through data streamer and we really don't have
> this. However, I think I have more generic solution in mind.
> 
> What if we add one more IgniteCache.loadCache overload like this:
> 
> loadCache(@Nullable IgniteBiPredicate<K, V> p, IgniteBiInClosure<K, V>
> clo, @Nullable
> Object... args)
> 
> It's the same as the existing one, but with the key-value closure provided
> as a parameter. This closure will be passed to the CacheStore.loadCache
> along with the arguments and will allow to override the logic that actually
> saves the loaded entry in cache (currently this logic is always provided by
> the cache itself and user can't control it).
> 
> We can then provide the implementation of this closure that will create a
> data streamer and call addData() within its apply() method.
> 
> I see the following advantages:
> 
>   - Any existing CacheStore implementation can be reused to load through
>   streamer (our JDBC and Cassandra stores or anything else that user has).
>   - Loading code is always part of CacheStore implementation, so it's very
>   easy to switch between different ways of loading.
>   - User is not limited by two approaches we provide out of the box, they
>   can always implement a new one.
> 
> Thoughts?
> 
> [1] https://apacheignite.readme.io/docs/automatic-persistence
> 
> -Val
> 
> On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov <akuznet...@apache.org>
> wrote:
> 
>> Hi, All!
>> 
>> I think we do not need to chage API at all.
>> 
>> public void loadCache(@Nullable IgniteBiPredicate<K, V> p, @Nullable
>> Object... args) throws CacheException;
>> 
>> We could pass any args to loadCache();
>> 
>> So we could create class
>> IgniteCacheLoadDescriptor {
>> some fields that will describe how to load
>> }
>> 
>> 
>> and modify POJO store to detect and use such arguments.
>> 
>> 
>> All we need is to implement this and write good documentation and examples.
>> 
>> Thoughts?
>> 
>> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin <ein.nsk...@gmail.com>
>> wrote:
>> 
>>> Hi Vladimir,
>>> 
>>> I don't offer any changes in API. Usage scenario is the same as it was
>>> described in
>>> https://apacheignite.readme.io/docs/persistent-store#section-loadcache-
>>> 
>>> The preload cache logic invokes IgniteCache.loadCache() with some
>>> additional arguments, depending on a CacheStore implementation, and then
>>> the loading occurs in the way I've already described.
>>> 
>>> 
>>> 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:
>>> 
>>>> Hi Alex,
>>>> 
>>>>>>> Let's give the user the reusable code which is convenient, reliable
>>> and
>>>> fast.
>>>> Convenience - this is why I asked for example on how API can look like
>>> and
>>>> how users are going to use it.
>>>> 
>>>> Vladimir.
>>>> 
>>>> On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin <
>>> ein.nsk...@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I think the discussion goes a wrong direction. Certainly it's not a
>> big
>>>>> deal to implement some custom user logic to load the data into
>> caches.
>>>> But
>>>>> Ignite framework gives the user some reusable code build on top of
>> the
>>>>> basic system.
>>>>> 
>>>>> So the main question is: Why developers let the user to use
>> convenient
>>>> way
>>>>> to load caches with totally non-optimal solution?
>>>>> 
>>>>> We could talk too much about different persistence storage types, but
>>>>> whenever we initiate the loading with IgniteCache.loadCache the
>> current
>>>>> implementation imposes much overhead on the network.
>>>>> 
>>>>> Partition-aware data loading may be used in some scenarios to avoid
>>> this
>>>>> network overhead, but the users are compelled to do additional steps
>> to
>>>>> achieve this optimization: adding the column to tables, adding
>> compound
>>>>> indices including the added column, write a peace of repeatable code
>> to
>>>>> load the data in different caches in fault-tolerant fashion, etc.
>>>>> 
>>>>> Let's give the user the reusable code which is convenient, reliable
>> and
>>>>> fast.
>>>>> 
>>>>> 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko <
>>>>> valentin.kuliche...@gmail.com>:
>>>>> 
>>>>>> Hi Aleksandr,
>>>>>> 
>>>>>> Data streamer is already outlined as one of the possible approaches
>>> for
>>>>>> loading the data [1]. Basically, you start a designated client node
>>> or
>>>>>> chose a leader among server nodes [1] and then use
>> IgniteDataStreamer
>>>> API
>>>>>> to load the data. With this approach there is no need to have the
>>>>>> CacheStore implementation at all. Can you please elaborate what
>>>>> additional
>>>>>> value are you trying to add here?
>>>>>> 
>>>>>> [1] https://apacheignite.readme.io/docs/data-loading#
>>>> ignitedatastreamer
>>>>>> [2] https://apacheignite.readme.io/docs/leader-election
>>>>>> 
>>>>>> -Val
>>>>>> 
>>>>>> On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan <
>>>>> dsetrak...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I just want to clarify a couple of API details from the original
>>>> email
>>>>> to
>>>>>>> make sure that we are making the right assumptions here.
>>>>>>> 
>>>>>>> *"Because of none keys are passed to the CacheStore.loadCache
>>>> methods,
>>>>>> the
>>>>>>>> underlying implementation is forced to read all the data from
>> the
>>>>>>>> persistence storage"*
>>>>>>> 
>>>>>>> 
>>>>>>> According to the javadoc, loadCache(...) method receives an
>>> optional
>>>>>>> argument from the user. You can pass anything you like,
>> including a
>>>>> list
>>>>>> of
>>>>>>> keys, or an SQL where clause, etc.
>>>>>>> 
>>>>>>> *"The partition-aware data loading approach is not a choice. It
>>>>> requires
>>>>>>>> persistence of the volatile data depended on affinity function
>>>>>>>> implementation and settings."*
>>>>>>> 
>>>>>>> 
>>>>>>> This is only partially true. While Ignite allows to plugin custom
>>>>>> affinity
>>>>>>> functions, the affinity function is not something that changes
>>>>>> dynamically
>>>>>>> and should always return the same partition for the same key.So,
>>> the
>>>>>>> partition assignments are not volatile at all. If, in some very
>>> rare
>>>>>> case,
>>>>>>> the partition assignment logic needs to change, then you could
>>> update
>>>>> the
>>>>>>> partition assignments that you may have persisted elsewhere as
>>> well,
>>>>> e.g.
>>>>>>> database.
>>>>>>> 
>>>>>>> D.
>>>>>>> 
>>>>>>> On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov <
>>>>> voze...@gridgain.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Alexandr, Alexey,
>>>>>>>> 
>>>>>>>> While I agree with you that current cache loading logic is far
>>> from
>>>>>>> ideal,
>>>>>>>> it would be cool to see API drafts based on your suggestions to
>>> get
>>>>>>> better
>>>>>>>> understanding of your ideas. How exactly users are going to use
>>>> your
>>>>>>>> suggestions?
>>>>>>>> 
>>>>>>>> My main concern is that initial load is not very trivial task
>> in
>>>>>> general
>>>>>>>> case. Some users have centralized RDBMS systems, some have
>> NoSQL,
>>>>>> others
>>>>>>>> work with distributed persistent stores (e.g. HDFS). Sometimes
>> we
>>>>> have
>>>>>>>> Ignite nodes "near" persistent data, sometimes we don't.
>>> Sharding,
>>>>>>>> affinity, co-location, etc.. If we try to support all (or many)
>>>> cases
>>>>>> out
>>>>>>>> of the box, we may end up in very messy and difficult API. So
>> we
>>>>> should
>>>>>>>> carefully balance between simplicity, usability and
>> feature-rich
>>>>>>>> characteristics here.
>>>>>>>> 
>>>>>>>> Personally, I think that if user is not satisfied with
>>>> "loadCache()"
>>>>>> API,
>>>>>>>> he just writes simple closure with blackjack streamer and
>> queries
>>>> and
>>>>>>> send
>>>>>>>> it to whatever node he finds convenient. Not a big deal. Only
>>> very
>>>>>> common
>>>>>>>> cases should be added to Ignite API.
>>>>>>>> 
>>>>>>>> Vladimir.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov <
>>>>>>>> akuznet...@gridgain.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Looks good for me.
>>>>>>>>> 
>>>>>>>>> But I will suggest to consider one more use-case:
>>>>>>>>> 
>>>>>>>>> If user knows its data he could manually split loading.
>>>>>>>>> For example: table Persons contains 10M rows.
>>>>>>>>> User could provide something like:
>>>>>>>>> cache.loadCache(null, "Person", "select * from Person where
>> id
>>> <
>>>>>>>>> 1_000_000",
>>>>>>>>> "Person", "select * from Person where id >=  1_000_000 and
>> id <
>>>>>>>> 2_000_000",
>>>>>>>>> ....
>>>>>>>>> "Person", "select * from Person where id >= 9_000_000 and id
>> <
>>>>>>>> 10_000_000",
>>>>>>>>> );
>>>>>>>>> 
>>>>>>>>> or may be it could be some descriptor object like
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>>   sql: select * from Person where id >=  ? and id < ?"
>>>>>>>>>   range: 0...10_000_000
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> In this case provided queries will be send to mach nodes as
>>>> number
>>>>> of
>>>>>>>>> queries.
>>>>>>>>> And data will be loaded in parallel and for keys that a not
>>>> local -
>>>>>>> data
>>>>>>>>> streamer
>>>>>>>>> should be used (as described Alexandr description).
>>>>>>>>> 
>>>>>>>>> I think it is a good issue for Ignite 2.0
>>>>>>>>> 
>>>>>>>>> Vova, Val - what do you think?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin <
>>>>>>>> ein.nsk...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> All right,
>>>>>>>>>> 
>>>>>>>>>> Let's assume a simple scenario. When the
>> IgniteCache.loadCache
>>>> is
>>>>>>>> invoked,
>>>>>>>>>> we check whether the cache is not local, and if so, then
>> we'll
>>>>>>> initiate
>>>>>>>>>> the
>>>>>>>>>> new loading logic.
>>>>>>>>>> 
>>>>>>>>>> First, we take a "streamer" node, it could be done by
>>>>>>>>>> utilizing LoadBalancingSpi, or it may be configured
>>> statically,
>>>>> for
>>>>>>> the
>>>>>>>>>> reason that the streamer node is running on the same host as
>>> the
>>>>>>>>>> persistence storage provider.
>>>>>>>>>> 
>>>>>>>>>> After that we start the loading task on the streamer node
>>> which
>>>>>>>>>> creates IgniteDataStreamer and loads the cache with
>>>>>>>> CacheStore.loadCache.
>>>>>>>>>> Every call to IgniteBiInClosure.apply simply
>>>>>>>>>> invokes IgniteDataStreamer.addData.
>>>>>>>>>> 
>>>>>>>>>> This implementation will completely relieve overhead on the
>>>>>>> persistence
>>>>>>>>>> storage provider. Network overhead is also decreased in the
>>> case
>>>>> of
>>>>>>>>>> partitioned caches. For two nodes we get 1-1/2 amount of
>> data
>>>>>>>> transferred
>>>>>>>>>> by the network (1 part well be transferred from the
>>> persistence
>>>>>>> storage
>>>>>>>> to
>>>>>>>>>> the streamer, and then 1/2 from the streamer node to the
>>> another
>>>>>>> node).
>>>>>>>>>> For
>>>>>>>>>> three nodes it will be 1-2/3 and so on, up to the two times
>>>> amount
>>>>>> of
>>>>>>>> data
>>>>>>>>>> on the big clusters.
>>>>>>>>>> 
>>>>>>>>>> I'd like to propose some additional optimization at this
>>> place.
>>>> If
>>>>>> we
>>>>>>>> have
>>>>>>>>>> the streamer node on the same machine as the persistence
>>> storage
>>>>>>>> provider,
>>>>>>>>>> then we completely relieve the network overhead as well. It
>>>> could
>>>>>> be a
>>>>>>>>>> some
>>>>>>>>>> special daemon node for the cache loading assigned in the
>>> cache
>>>>>>>>>> configuration, or an ordinary sever node as well.
>>>>>>>>>> 
>>>>>>>>>> Certainly this calculations have been done in assumption
>> that
>>> we
>>>>>> have
>>>>>>>> even
>>>>>>>>>> partitioned cache with only primary nodes (without backups).
>>> In
>>>>> the
>>>>>>> case
>>>>>>>>>> of
>>>>>>>>>> one backup (the most frequent case I think), we get 2 amount
>>> of
>>>>> data
>>>>>>>>>> transferred by the network on two nodes, 2-1/3 on three,
>> 2-1/2
>>>> on
>>>>>>> four,
>>>>>>>>>> and
>>>>>>>>>> so on up to the three times amount of data on the big
>>> clusters.
>>>>>> Hence
>>>>>>>> it's
>>>>>>>>>> still better than the current implementation. In the worst
>>> case
>>>>>> with a
>>>>>>>>>> fully replicated cache we take N+1 amount of data
>> transferred
>>> by
>>>>> the
>>>>>>>>>> network (where N is the number of nodes in the cluster). But
>>>> it's
>>>>>> not
>>>>>>> a
>>>>>>>>>> problem in small clusters, and a little overhead in big
>>>> clusters.
>>>>>> And
>>>>>>> we
>>>>>>>>>> still gain the persistence storage provider optimization.
>>>>>>>>>> 
>>>>>>>>>> Now let's take more complex scenario. To achieve some level
>> of
>>>>>>>>>> parallelism,
>>>>>>>>>> we could split our cluster on several groups. It could be a
>>>>>> parameter
>>>>>>> of
>>>>>>>>>> the IgniteCache.loadCache method or a cache configuration
>>>> option.
>>>>>> The
>>>>>>>>>> number of groups could be a fixed value, or it could be
>>>> calculated
>>>>>>>>>> dynamically by the maximum number of nodes in the group.
>>>>>>>>>> 
>>>>>>>>>> After splitting the whole cluster on groups we will take the
>>>>>> streamer
>>>>>>>> node
>>>>>>>>>> in the each group and submit the task for loading the cache
>>>>> similar
>>>>>> to
>>>>>>>> the
>>>>>>>>>> single streamer scenario, except as the only keys will be
>>> passed
>>>>> to
>>>>>>>>>> the IgniteDataStreamer.addData method those correspond to
>> the
>>>>>> cluster
>>>>>>>>>> group
>>>>>>>>>> where is the streamer node running.
>>>>>>>>>> 
>>>>>>>>>> In this case we get equal level of overhead as the
>>> parallelism,
>>>>> but
>>>>>>> not
>>>>>>>> so
>>>>>>>>>> surplus as how many nodes in whole the cluster.
>>>>>>>>>> 
>>>>>>>>>> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov <
>>>>> akuznet...@apache.org
>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> Alexandr,
>>>>>>>>>>> 
>>>>>>>>>>> Could you describe your proposal in more details?
>>>>>>>>>>> Especially in case with several nodes.
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin <
>>>>>>>>>> ein.nsk...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> You know CacheStore API that is commonly used for
>>>>>>> read/write-through
>>>>>>>>>>>> relationship of the in-memory data with the persistence
>>>>> storage.
>>>>>>>>>>>> 
>>>>>>>>>>>> There is also IgniteCache.loadCache method for
>> hot-loading
>>>> the
>>>>>>> cache
>>>>>>>>>> on
>>>>>>>>>>>> startup. Invocation of this method causes execution of
>>>>>>>>>>> CacheStore.loadCache
>>>>>>>>>>>> on the all nodes storing the cache partitions. Because
>> of
>>>> none
>>>>>>> keys
>>>>>>>>>> are
>>>>>>>>>>>> passed to the CacheStore.loadCache methods, the
>> underlying
>>>>>>>>>> implementation
>>>>>>>>>>>> is forced to read all the data from the persistence
>>> storage,
>>>>> but
>>>>>>>> only
>>>>>>>>>>> part
>>>>>>>>>>>> of the data will be stored on each node.
>>>>>>>>>>>> 
>>>>>>>>>>>> So, the current implementation have two general
>> drawbacks:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Persistence storage is forced to perform as many
>>>> identical
>>>>>>>> queries
>>>>>>>>>> as
>>>>>>>>>>>> many nodes on the cluster. Each query may involve much
>>>>>> additional
>>>>>>>>>>>> computation on the persistence storage server.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Network is forced to transfer much more data, so
>>>> obviously
>>>>>> the
>>>>>>>> big
>>>>>>>>>>>> disadvantage on large systems.
>>>>>>>>>>>> 
>>>>>>>>>>>> The partition-aware data loading approach, described in
>>>>>>>>>>>> https://apacheignite.readme.
>> io/docs/data-loading#section-
>>>>>>>>>>>> partition-aware-data-loading
>>>>>>>>>>>> , is not a choice. It requires persistence of the
>> volatile
>>>>> data
>>>>>>>>>> depended
>>>>>>>>>>> on
>>>>>>>>>>>> affinity function implementation and settings.
>>>>>>>>>>>> 
>>>>>>>>>>>> I propose using something like IgniteDataStreamer inside
>>>>>>>>>>>> IgniteCache.loadCache implementation.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Alexandr Kuramshin
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Alexey Kuznetsov
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>> Alexandr Kuramshin
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Alexey Kuznetsov
>>>>>>>>> GridGain Systems
>>>>>>>>> www.gridgain.com
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks,
>>>>> Alexandr Kuramshin
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks,
>>> Alexandr Kuramshin
>>> 
>> 
>> 
>> 
>> --
>> Alexey Kuznetsov
>>

Re: IgniteCache.loadCache improvement proposal

Reply via email to