Hi Vladimir, I don't offer any changes in API. Usage scenario is the same as it was described in https://apacheignite.readme.io/docs/persistent-store#section-loadcache-
The preload cache logic invokes IgniteCache.loadCache() with some additional arguments, depending on a CacheStore implementation, and then the loading occurs in the way I've already described. 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > Hi Alex, > > >>> Let's give the user the reusable code which is convenient, reliable and > fast. > Convenience - this is why I asked for example on how API can look like and > how users are going to use it. > > Vladimir. > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin <ein.nsk...@gmail.com > > > wrote: > > > Hi all, > > > > I think the discussion goes a wrong direction. Certainly it's not a big > > deal to implement some custom user logic to load the data into caches. > But > > Ignite framework gives the user some reusable code build on top of the > > basic system. > > > > So the main question is: Why developers let the user to use convenient > way > > to load caches with totally non-optimal solution? > > > > We could talk too much about different persistence storage types, but > > whenever we initiate the loading with IgniteCache.loadCache the current > > implementation imposes much overhead on the network. > > > > Partition-aware data loading may be used in some scenarios to avoid this > > network overhead, but the users are compelled to do additional steps to > > achieve this optimization: adding the column to tables, adding compound > > indices including the added column, write a peace of repeatable code to > > load the data in different caches in fault-tolerant fashion, etc. > > > > Let's give the user the reusable code which is convenient, reliable and > > fast. > > > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > > valentin.kuliche...@gmail.com>: > > > > > Hi Aleksandr, > > > > > > Data streamer is already outlined as one of the possible approaches for > > > loading the data [1]. Basically, you start a designated client node or > > > chose a leader among server nodes [1] and then use IgniteDataStreamer > API > > > to load the data. With this approach there is no need to have the > > > CacheStore implementation at all. Can you please elaborate what > > additional > > > value are you trying to add here? > > > > > > [1] https://apacheignite.readme.io/docs/data-loading# > ignitedatastreamer > > > [2] https://apacheignite.readme.io/docs/leader-election > > > > > > -Val > > > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < > > dsetrak...@apache.org> > > > wrote: > > > > > > > Hi, > > > > > > > > I just want to clarify a couple of API details from the original > email > > to > > > > make sure that we are making the right assumptions here. > > > > > > > > *"Because of none keys are passed to the CacheStore.loadCache > methods, > > > the > > > > > underlying implementation is forced to read all the data from the > > > > > persistence storage"* > > > > > > > > > > > > According to the javadoc, loadCache(...) method receives an optional > > > > argument from the user. You can pass anything you like, including a > > list > > > of > > > > keys, or an SQL where clause, etc. > > > > > > > > *"The partition-aware data loading approach is not a choice. It > > requires > > > > > persistence of the volatile data depended on affinity function > > > > > implementation and settings."* > > > > > > > > > > > > This is only partially true. While Ignite allows to plugin custom > > > affinity > > > > functions, the affinity function is not something that changes > > > dynamically > > > > and should always return the same partition for the same key.So, the > > > > partition assignments are not volatile at all. If, in some very rare > > > case, > > > > the partition assignment logic needs to change, then you could update > > the > > > > partition assignments that you may have persisted elsewhere as well, > > e.g. > > > > database. > > > > > > > > D. > > > > > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov < > > voze...@gridgain.com> > > > > wrote: > > > > > > > > > Alexandr, Alexey, > > > > > > > > > > While I agree with you that current cache loading logic is far from > > > > ideal, > > > > > it would be cool to see API drafts based on your suggestions to get > > > > better > > > > > understanding of your ideas. How exactly users are going to use > your > > > > > suggestions? > > > > > > > > > > My main concern is that initial load is not very trivial task in > > > general > > > > > case. Some users have centralized RDBMS systems, some have NoSQL, > > > others > > > > > work with distributed persistent stores (e.g. HDFS). Sometimes we > > have > > > > > Ignite nodes "near" persistent data, sometimes we don't. Sharding, > > > > > affinity, co-location, etc.. If we try to support all (or many) > cases > > > out > > > > > of the box, we may end up in very messy and difficult API. So we > > should > > > > > carefully balance between simplicity, usability and feature-rich > > > > > characteristics here. > > > > > > > > > > Personally, I think that if user is not satisfied with > "loadCache()" > > > API, > > > > > he just writes simple closure with blackjack streamer and queries > and > > > > send > > > > > it to whatever node he finds convenient. Not a big deal. Only very > > > common > > > > > cases should be added to Ignite API. > > > > > > > > > > Vladimir. > > > > > > > > > > > > > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < > > > > > akuznet...@gridgain.com> > > > > > wrote: > > > > > > > > > > > Looks good for me. > > > > > > > > > > > > But I will suggest to consider one more use-case: > > > > > > > > > > > > If user knows its data he could manually split loading. > > > > > > For example: table Persons contains 10M rows. > > > > > > User could provide something like: > > > > > > cache.loadCache(null, "Person", "select * from Person where id < > > > > > > 1_000_000", > > > > > > "Person", "select * from Person where id >= 1_000_000 and id < > > > > > 2_000_000", > > > > > > .... > > > > > > "Person", "select * from Person where id >= 9_000_000 and id < > > > > > 10_000_000", > > > > > > ); > > > > > > > > > > > > or may be it could be some descriptor object like > > > > > > > > > > > > { > > > > > > sql: select * from Person where id >= ? and id < ?" > > > > > > range: 0...10_000_000 > > > > > > } > > > > > > > > > > > > In this case provided queries will be send to mach nodes as > number > > of > > > > > > queries. > > > > > > And data will be loaded in parallel and for keys that a not > local - > > > > data > > > > > > streamer > > > > > > should be used (as described Alexandr description). > > > > > > > > > > > > I think it is a good issue for Ignite 2.0 > > > > > > > > > > > > Vova, Val - what do you think? > > > > > > > > > > > > > > > > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin < > > > > > ein.nsk...@gmail.com> > > > > > > wrote: > > > > > > > > > > > >> All right, > > > > > >> > > > > > >> Let's assume a simple scenario. When the IgniteCache.loadCache > is > > > > > invoked, > > > > > >> we check whether the cache is not local, and if so, then we'll > > > > initiate > > > > > >> the > > > > > >> new loading logic. > > > > > >> > > > > > >> First, we take a "streamer" node, it could be done by > > > > > >> utilizing LoadBalancingSpi, or it may be configured statically, > > for > > > > the > > > > > >> reason that the streamer node is running on the same host as the > > > > > >> persistence storage provider. > > > > > >> > > > > > >> After that we start the loading task on the streamer node which > > > > > >> creates IgniteDataStreamer and loads the cache with > > > > > CacheStore.loadCache. > > > > > >> Every call to IgniteBiInClosure.apply simply > > > > > >> invokes IgniteDataStreamer.addData. > > > > > >> > > > > > >> This implementation will completely relieve overhead on the > > > > persistence > > > > > >> storage provider. Network overhead is also decreased in the case > > of > > > > > >> partitioned caches. For two nodes we get 1-1/2 amount of data > > > > > transferred > > > > > >> by the network (1 part well be transferred from the persistence > > > > storage > > > > > to > > > > > >> the streamer, and then 1/2 from the streamer node to the another > > > > node). > > > > > >> For > > > > > >> three nodes it will be 1-2/3 and so on, up to the two times > amount > > > of > > > > > data > > > > > >> on the big clusters. > > > > > >> > > > > > >> I'd like to propose some additional optimization at this place. > If > > > we > > > > > have > > > > > >> the streamer node on the same machine as the persistence storage > > > > > provider, > > > > > >> then we completely relieve the network overhead as well. It > could > > > be a > > > > > >> some > > > > > >> special daemon node for the cache loading assigned in the cache > > > > > >> configuration, or an ordinary sever node as well. > > > > > >> > > > > > >> Certainly this calculations have been done in assumption that we > > > have > > > > > even > > > > > >> partitioned cache with only primary nodes (without backups). In > > the > > > > case > > > > > >> of > > > > > >> one backup (the most frequent case I think), we get 2 amount of > > data > > > > > >> transferred by the network on two nodes, 2-1/3 on three, 2-1/2 > on > > > > four, > > > > > >> and > > > > > >> so on up to the three times amount of data on the big clusters. > > > Hence > > > > > it's > > > > > >> still better than the current implementation. In the worst case > > > with a > > > > > >> fully replicated cache we take N+1 amount of data transferred by > > the > > > > > >> network (where N is the number of nodes in the cluster). But > it's > > > not > > > > a > > > > > >> problem in small clusters, and a little overhead in big > clusters. > > > And > > > > we > > > > > >> still gain the persistence storage provider optimization. > > > > > >> > > > > > >> Now let's take more complex scenario. To achieve some level of > > > > > >> parallelism, > > > > > >> we could split our cluster on several groups. It could be a > > > parameter > > > > of > > > > > >> the IgniteCache.loadCache method or a cache configuration > option. > > > The > > > > > >> number of groups could be a fixed value, or it could be > calculated > > > > > >> dynamically by the maximum number of nodes in the group. > > > > > >> > > > > > >> After splitting the whole cluster on groups we will take the > > > streamer > > > > > node > > > > > >> in the each group and submit the task for loading the cache > > similar > > > to > > > > > the > > > > > >> single streamer scenario, except as the only keys will be passed > > to > > > > > >> the IgniteDataStreamer.addData method those correspond to the > > > cluster > > > > > >> group > > > > > >> where is the streamer node running. > > > > > >> > > > > > >> In this case we get equal level of overhead as the parallelism, > > but > > > > not > > > > > so > > > > > >> surplus as how many nodes in whole the cluster. > > > > > >> > > > > > >> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov < > > akuznet...@apache.org > > > >: > > > > > >> > > > > > >> > Alexandr, > > > > > >> > > > > > > >> > Could you describe your proposal in more details? > > > > > >> > Especially in case with several nodes. > > > > > >> > > > > > > >> > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin < > > > > > >> ein.nsk...@gmail.com> > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Hi, > > > > > >> > > > > > > > >> > > You know CacheStore API that is commonly used for > > > > read/write-through > > > > > >> > > relationship of the in-memory data with the persistence > > storage. > > > > > >> > > > > > > > >> > > There is also IgniteCache.loadCache method for hot-loading > the > > > > cache > > > > > >> on > > > > > >> > > startup. Invocation of this method causes execution of > > > > > >> > CacheStore.loadCache > > > > > >> > > on the all nodes storing the cache partitions. Because of > none > > > > keys > > > > > >> are > > > > > >> > > passed to the CacheStore.loadCache methods, the underlying > > > > > >> implementation > > > > > >> > > is forced to read all the data from the persistence storage, > > but > > > > > only > > > > > >> > part > > > > > >> > > of the data will be stored on each node. > > > > > >> > > > > > > > >> > > So, the current implementation have two general drawbacks: > > > > > >> > > > > > > > >> > > 1. Persistence storage is forced to perform as many > identical > > > > > queries > > > > > >> as > > > > > >> > > many nodes on the cluster. Each query may involve much > > > additional > > > > > >> > > computation on the persistence storage server. > > > > > >> > > > > > > > >> > > 2. Network is forced to transfer much more data, so > obviously > > > the > > > > > big > > > > > >> > > disadvantage on large systems. > > > > > >> > > > > > > > >> > > The partition-aware data loading approach, described in > > > > > >> > > https://apacheignite.readme.io/docs/data-loading#section- > > > > > >> > > partition-aware-data-loading > > > > > >> > > , is not a choice. It requires persistence of the volatile > > data > > > > > >> depended > > > > > >> > on > > > > > >> > > affinity function implementation and settings. > > > > > >> > > > > > > > >> > > I propose using something like IgniteDataStreamer inside > > > > > >> > > IgniteCache.loadCache implementation. > > > > > >> > > > > > > > >> > > > > > > > >> > > -- > > > > > >> > > Thanks, > > > > > >> > > Alexandr Kuramshin > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > -- > > > > > >> > Alexey Kuznetsov > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Thanks, > > > > > >> Alexandr Kuramshin > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Alexey Kuznetsov > > > > > > GridGain Systems > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Thanks, > > Alexandr Kuramshin > > > -- Thanks, Alexandr Kuramshin