How would your proposal resolve the main point Aleksandr is trying to convey that is extensive network utilization?
As I see the loadCache method still will be triggered on every and as before all the nodes will pre-load all the data set from a database. That was Aleksandr’s reasonable concern. If we make up a way how to call the loadCache on a specific node only and implement some falt-tolerant mechanism then your suggestion should work perfectly fine. — Denis > On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko > <valentin.kuliche...@gmail.com> wrote: > > It sounds like Aleksandr is basically proposing to support automatic > persistence [1] for loading through data streamer and we really don't have > this. However, I think I have more generic solution in mind. > > What if we add one more IgniteCache.loadCache overload like this: > > loadCache(@Nullable IgniteBiPredicate<K, V> p, IgniteBiInClosure<K, V> > clo, @Nullable > Object... args) > > It's the same as the existing one, but with the key-value closure provided > as a parameter. This closure will be passed to the CacheStore.loadCache > along with the arguments and will allow to override the logic that actually > saves the loaded entry in cache (currently this logic is always provided by > the cache itself and user can't control it). > > We can then provide the implementation of this closure that will create a > data streamer and call addData() within its apply() method. > > I see the following advantages: > > - Any existing CacheStore implementation can be reused to load through > streamer (our JDBC and Cassandra stores or anything else that user has). > - Loading code is always part of CacheStore implementation, so it's very > easy to switch between different ways of loading. > - User is not limited by two approaches we provide out of the box, they > can always implement a new one. > > Thoughts? > > [1] https://apacheignite.readme.io/docs/automatic-persistence > > -Val > > On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov <akuznet...@apache.org> > wrote: > >> Hi, All! >> >> I think we do not need to chage API at all. >> >> public void loadCache(@Nullable IgniteBiPredicate<K, V> p, @Nullable >> Object... args) throws CacheException; >> >> We could pass any args to loadCache(); >> >> So we could create class >> IgniteCacheLoadDescriptor { >> some fields that will describe how to load >> } >> >> >> and modify POJO store to detect and use such arguments. >> >> >> All we need is to implement this and write good documentation and examples. >> >> Thoughts? >> >> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin <ein.nsk...@gmail.com> >> wrote: >> >>> Hi Vladimir, >>> >>> I don't offer any changes in API. Usage scenario is the same as it was >>> described in >>> https://apacheignite.readme.io/docs/persistent-store#section-loadcache- >>> >>> The preload cache logic invokes IgniteCache.loadCache() with some >>> additional arguments, depending on a CacheStore implementation, and then >>> the loading occurs in the way I've already described. >>> >>> >>> 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: >>> >>>> Hi Alex, >>>> >>>>>>> Let's give the user the reusable code which is convenient, reliable >>> and >>>> fast. >>>> Convenience - this is why I asked for example on how API can look like >>> and >>>> how users are going to use it. >>>> >>>> Vladimir. >>>> >>>> On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin < >>> ein.nsk...@gmail.com >>>>> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I think the discussion goes a wrong direction. Certainly it's not a >> big >>>>> deal to implement some custom user logic to load the data into >> caches. >>>> But >>>>> Ignite framework gives the user some reusable code build on top of >> the >>>>> basic system. >>>>> >>>>> So the main question is: Why developers let the user to use >> convenient >>>> way >>>>> to load caches with totally non-optimal solution? >>>>> >>>>> We could talk too much about different persistence storage types, but >>>>> whenever we initiate the loading with IgniteCache.loadCache the >> current >>>>> implementation imposes much overhead on the network. >>>>> >>>>> Partition-aware data loading may be used in some scenarios to avoid >>> this >>>>> network overhead, but the users are compelled to do additional steps >> to >>>>> achieve this optimization: adding the column to tables, adding >> compound >>>>> indices including the added column, write a peace of repeatable code >> to >>>>> load the data in different caches in fault-tolerant fashion, etc. >>>>> >>>>> Let's give the user the reusable code which is convenient, reliable >> and >>>>> fast. >>>>> >>>>> 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < >>>>> valentin.kuliche...@gmail.com>: >>>>> >>>>>> Hi Aleksandr, >>>>>> >>>>>> Data streamer is already outlined as one of the possible approaches >>> for >>>>>> loading the data [1]. Basically, you start a designated client node >>> or >>>>>> chose a leader among server nodes [1] and then use >> IgniteDataStreamer >>>> API >>>>>> to load the data. With this approach there is no need to have the >>>>>> CacheStore implementation at all. Can you please elaborate what >>>>> additional >>>>>> value are you trying to add here? >>>>>> >>>>>> [1] https://apacheignite.readme.io/docs/data-loading# >>>> ignitedatastreamer >>>>>> [2] https://apacheignite.readme.io/docs/leader-election >>>>>> >>>>>> -Val >>>>>> >>>>>> On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < >>>>> dsetrak...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I just want to clarify a couple of API details from the original >>>> email >>>>> to >>>>>>> make sure that we are making the right assumptions here. >>>>>>> >>>>>>> *"Because of none keys are passed to the CacheStore.loadCache >>>> methods, >>>>>> the >>>>>>>> underlying implementation is forced to read all the data from >> the >>>>>>>> persistence storage"* >>>>>>> >>>>>>> >>>>>>> According to the javadoc, loadCache(...) method receives an >>> optional >>>>>>> argument from the user. You can pass anything you like, >> including a >>>>> list >>>>>> of >>>>>>> keys, or an SQL where clause, etc. >>>>>>> >>>>>>> *"The partition-aware data loading approach is not a choice. It >>>>> requires >>>>>>>> persistence of the volatile data depended on affinity function >>>>>>>> implementation and settings."* >>>>>>> >>>>>>> >>>>>>> This is only partially true. While Ignite allows to plugin custom >>>>>> affinity >>>>>>> functions, the affinity function is not something that changes >>>>>> dynamically >>>>>>> and should always return the same partition for the same key.So, >>> the >>>>>>> partition assignments are not volatile at all. If, in some very >>> rare >>>>>> case, >>>>>>> the partition assignment logic needs to change, then you could >>> update >>>>> the >>>>>>> partition assignments that you may have persisted elsewhere as >>> well, >>>>> e.g. >>>>>>> database. >>>>>>> >>>>>>> D. >>>>>>> >>>>>>> On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov < >>>>> voze...@gridgain.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Alexandr, Alexey, >>>>>>>> >>>>>>>> While I agree with you that current cache loading logic is far >>> from >>>>>>> ideal, >>>>>>>> it would be cool to see API drafts based on your suggestions to >>> get >>>>>>> better >>>>>>>> understanding of your ideas. How exactly users are going to use >>>> your >>>>>>>> suggestions? >>>>>>>> >>>>>>>> My main concern is that initial load is not very trivial task >> in >>>>>> general >>>>>>>> case. Some users have centralized RDBMS systems, some have >> NoSQL, >>>>>> others >>>>>>>> work with distributed persistent stores (e.g. HDFS). Sometimes >> we >>>>> have >>>>>>>> Ignite nodes "near" persistent data, sometimes we don't. >>> Sharding, >>>>>>>> affinity, co-location, etc.. If we try to support all (or many) >>>> cases >>>>>> out >>>>>>>> of the box, we may end up in very messy and difficult API. So >> we >>>>> should >>>>>>>> carefully balance between simplicity, usability and >> feature-rich >>>>>>>> characteristics here. >>>>>>>> >>>>>>>> Personally, I think that if user is not satisfied with >>>> "loadCache()" >>>>>> API, >>>>>>>> he just writes simple closure with blackjack streamer and >> queries >>>> and >>>>>>> send >>>>>>>> it to whatever node he finds convenient. Not a big deal. Only >>> very >>>>>> common >>>>>>>> cases should be added to Ignite API. >>>>>>>> >>>>>>>> Vladimir. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < >>>>>>>> akuznet...@gridgain.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Looks good for me. >>>>>>>>> >>>>>>>>> But I will suggest to consider one more use-case: >>>>>>>>> >>>>>>>>> If user knows its data he could manually split loading. >>>>>>>>> For example: table Persons contains 10M rows. >>>>>>>>> User could provide something like: >>>>>>>>> cache.loadCache(null, "Person", "select * from Person where >> id >>> < >>>>>>>>> 1_000_000", >>>>>>>>> "Person", "select * from Person where id >= 1_000_000 and >> id < >>>>>>>> 2_000_000", >>>>>>>>> .... >>>>>>>>> "Person", "select * from Person where id >= 9_000_000 and id >> < >>>>>>>> 10_000_000", >>>>>>>>> ); >>>>>>>>> >>>>>>>>> or may be it could be some descriptor object like >>>>>>>>> >>>>>>>>> { >>>>>>>>> sql: select * from Person where id >= ? and id < ?" >>>>>>>>> range: 0...10_000_000 >>>>>>>>> } >>>>>>>>> >>>>>>>>> In this case provided queries will be send to mach nodes as >>>> number >>>>> of >>>>>>>>> queries. >>>>>>>>> And data will be loaded in parallel and for keys that a not >>>> local - >>>>>>> data >>>>>>>>> streamer >>>>>>>>> should be used (as described Alexandr description). >>>>>>>>> >>>>>>>>> I think it is a good issue for Ignite 2.0 >>>>>>>>> >>>>>>>>> Vova, Val - what do you think? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin < >>>>>>>> ein.nsk...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> All right, >>>>>>>>>> >>>>>>>>>> Let's assume a simple scenario. When the >> IgniteCache.loadCache >>>> is >>>>>>>> invoked, >>>>>>>>>> we check whether the cache is not local, and if so, then >> we'll >>>>>>> initiate >>>>>>>>>> the >>>>>>>>>> new loading logic. >>>>>>>>>> >>>>>>>>>> First, we take a "streamer" node, it could be done by >>>>>>>>>> utilizing LoadBalancingSpi, or it may be configured >>> statically, >>>>> for >>>>>>> the >>>>>>>>>> reason that the streamer node is running on the same host as >>> the >>>>>>>>>> persistence storage provider. >>>>>>>>>> >>>>>>>>>> After that we start the loading task on the streamer node >>> which >>>>>>>>>> creates IgniteDataStreamer and loads the cache with >>>>>>>> CacheStore.loadCache. >>>>>>>>>> Every call to IgniteBiInClosure.apply simply >>>>>>>>>> invokes IgniteDataStreamer.addData. >>>>>>>>>> >>>>>>>>>> This implementation will completely relieve overhead on the >>>>>>> persistence >>>>>>>>>> storage provider. Network overhead is also decreased in the >>> case >>>>> of >>>>>>>>>> partitioned caches. For two nodes we get 1-1/2 amount of >> data >>>>>>>> transferred >>>>>>>>>> by the network (1 part well be transferred from the >>> persistence >>>>>>> storage >>>>>>>> to >>>>>>>>>> the streamer, and then 1/2 from the streamer node to the >>> another >>>>>>> node). >>>>>>>>>> For >>>>>>>>>> three nodes it will be 1-2/3 and so on, up to the two times >>>> amount >>>>>> of >>>>>>>> data >>>>>>>>>> on the big clusters. >>>>>>>>>> >>>>>>>>>> I'd like to propose some additional optimization at this >>> place. >>>> If >>>>>> we >>>>>>>> have >>>>>>>>>> the streamer node on the same machine as the persistence >>> storage >>>>>>>> provider, >>>>>>>>>> then we completely relieve the network overhead as well. It >>>> could >>>>>> be a >>>>>>>>>> some >>>>>>>>>> special daemon node for the cache loading assigned in the >>> cache >>>>>>>>>> configuration, or an ordinary sever node as well. >>>>>>>>>> >>>>>>>>>> Certainly this calculations have been done in assumption >> that >>> we >>>>>> have >>>>>>>> even >>>>>>>>>> partitioned cache with only primary nodes (without backups). >>> In >>>>> the >>>>>>> case >>>>>>>>>> of >>>>>>>>>> one backup (the most frequent case I think), we get 2 amount >>> of >>>>> data >>>>>>>>>> transferred by the network on two nodes, 2-1/3 on three, >> 2-1/2 >>>> on >>>>>>> four, >>>>>>>>>> and >>>>>>>>>> so on up to the three times amount of data on the big >>> clusters. >>>>>> Hence >>>>>>>> it's >>>>>>>>>> still better than the current implementation. In the worst >>> case >>>>>> with a >>>>>>>>>> fully replicated cache we take N+1 amount of data >> transferred >>> by >>>>> the >>>>>>>>>> network (where N is the number of nodes in the cluster). But >>>> it's >>>>>> not >>>>>>> a >>>>>>>>>> problem in small clusters, and a little overhead in big >>>> clusters. >>>>>> And >>>>>>> we >>>>>>>>>> still gain the persistence storage provider optimization. >>>>>>>>>> >>>>>>>>>> Now let's take more complex scenario. To achieve some level >> of >>>>>>>>>> parallelism, >>>>>>>>>> we could split our cluster on several groups. It could be a >>>>>> parameter >>>>>>> of >>>>>>>>>> the IgniteCache.loadCache method or a cache configuration >>>> option. >>>>>> The >>>>>>>>>> number of groups could be a fixed value, or it could be >>>> calculated >>>>>>>>>> dynamically by the maximum number of nodes in the group. >>>>>>>>>> >>>>>>>>>> After splitting the whole cluster on groups we will take the >>>>>> streamer >>>>>>>> node >>>>>>>>>> in the each group and submit the task for loading the cache >>>>> similar >>>>>> to >>>>>>>> the >>>>>>>>>> single streamer scenario, except as the only keys will be >>> passed >>>>> to >>>>>>>>>> the IgniteDataStreamer.addData method those correspond to >> the >>>>>> cluster >>>>>>>>>> group >>>>>>>>>> where is the streamer node running. >>>>>>>>>> >>>>>>>>>> In this case we get equal level of overhead as the >>> parallelism, >>>>> but >>>>>>> not >>>>>>>> so >>>>>>>>>> surplus as how many nodes in whole the cluster. >>>>>>>>>> >>>>>>>>>> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov < >>>>> akuznet...@apache.org >>>>>>> : >>>>>>>>>> >>>>>>>>>>> Alexandr, >>>>>>>>>>> >>>>>>>>>>> Could you describe your proposal in more details? >>>>>>>>>>> Especially in case with several nodes. >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin < >>>>>>>>>> ein.nsk...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> You know CacheStore API that is commonly used for >>>>>>> read/write-through >>>>>>>>>>>> relationship of the in-memory data with the persistence >>>>> storage. >>>>>>>>>>>> >>>>>>>>>>>> There is also IgniteCache.loadCache method for >> hot-loading >>>> the >>>>>>> cache >>>>>>>>>> on >>>>>>>>>>>> startup. Invocation of this method causes execution of >>>>>>>>>>> CacheStore.loadCache >>>>>>>>>>>> on the all nodes storing the cache partitions. Because >> of >>>> none >>>>>>> keys >>>>>>>>>> are >>>>>>>>>>>> passed to the CacheStore.loadCache methods, the >> underlying >>>>>>>>>> implementation >>>>>>>>>>>> is forced to read all the data from the persistence >>> storage, >>>>> but >>>>>>>> only >>>>>>>>>>> part >>>>>>>>>>>> of the data will be stored on each node. >>>>>>>>>>>> >>>>>>>>>>>> So, the current implementation have two general >> drawbacks: >>>>>>>>>>>> >>>>>>>>>>>> 1. Persistence storage is forced to perform as many >>>> identical >>>>>>>> queries >>>>>>>>>> as >>>>>>>>>>>> many nodes on the cluster. Each query may involve much >>>>>> additional >>>>>>>>>>>> computation on the persistence storage server. >>>>>>>>>>>> >>>>>>>>>>>> 2. Network is forced to transfer much more data, so >>>> obviously >>>>>> the >>>>>>>> big >>>>>>>>>>>> disadvantage on large systems. >>>>>>>>>>>> >>>>>>>>>>>> The partition-aware data loading approach, described in >>>>>>>>>>>> https://apacheignite.readme. >> io/docs/data-loading#section- >>>>>>>>>>>> partition-aware-data-loading >>>>>>>>>>>> , is not a choice. It requires persistence of the >> volatile >>>>> data >>>>>>>>>> depended >>>>>>>>>>> on >>>>>>>>>>>> affinity function implementation and settings. >>>>>>>>>>>> >>>>>>>>>>>> I propose using something like IgniteDataStreamer inside >>>>>>>>>>>> IgniteCache.loadCache implementation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Alexandr Kuramshin >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Alexey Kuznetsov >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Thanks, >>>>>>>>>> Alexandr Kuramshin >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Alexey Kuznetsov >>>>>>>>> GridGain Systems >>>>>>>>> www.gridgain.com >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> Alexandr Kuramshin >>>>> >>>> >>> >>> >>> >>> -- >>> Thanks, >>> Alexandr Kuramshin >>> >> >> >> >> -- >> Alexey Kuznetsov >>