Re: IgniteCache.loadCache improvement proposal
Val, Yakov, Sorry for delay, I need time to think and to do some tests. Anyway, extending the API and supply default implementation - is good. It makes frameworks more flexible and usable. But your proposal of extension will not solve the problem that I have raise. Please, read the next with special attention. Current implementation IgniteCache.loadCache causes parallel execution of IgniteCache.localLoadCache on each node in the cluster. It's bad implementation, but it's *right semantic*. You propose to extend IgniteCache.localLoadCache and use it to load data on all the nodes. It's bad semantic. But it also leads to bad implementation. Please note why. When you filter the data with the supplied IgniteBiPredicate, you may access the data that must be co-located. Hence to load the data to all the nodes, you need access to all the related data partitioned by the cluster. This leads to great network overhead and near caches overload. And that is why am I wondering that IgniteBiPredicate is executed for every key supplied by Cache.loadCache, but not only for those keys, which will be stored on this node. My opinion in conclusion. localLoadCache should first filter a key by the affinity function and the current cache topology, *then *invoke the predicate, and then store the entity in the cache (possibly by invoking the supplied closure). All associated partitions should be locked for the time of loading. IgniteCache.loadCache should perform Cache.loadCache on the one (or some more) nodes, then transfer entities to the remote nodes, *then *invoke the predicate and closure on the remote nodes. 2016-11-22 2:16 GMT+03:00 Valentin Kulichenko: > Guys, > > I created a ticket for this: > https://issues.apache.org/jira/browse/IGNITE-4255 > > Feel free to provide comments. > > -Val > > On Sat, Nov 19, 2016 at 6:56 AM, Yakov Zhdanov > wrote: > > > > > > > > > > Why not store the partition ID in the database and query only local > > > partitions? Whatever approach we design with a DataStreamer will be > > slower > > > than this. > > > > > > > Because this can be some generic DB. Imagine the app migrating to IMDG. > > > > I am pretty sure that in many cases approach with data streamer will be > > faster and in many cases approach with multiple queries will be faster. > And > > the choice should depend on many factors. I like Val's suggestions. I > think > > he goes in the right direction. > > > > --Yakov > > > -- Thanks, Alexandr Kuramshin
Re: IgniteCache.loadCache improvement proposal
Guys, I created a ticket for this: https://issues.apache.org/jira/browse/IGNITE-4255 Feel free to provide comments. -Val On Sat, Nov 19, 2016 at 6:56 AM, Yakov Zhdanovwrote: > > > > > > Why not store the partition ID in the database and query only local > > partitions? Whatever approach we design with a DataStreamer will be > slower > > than this. > > > > Because this can be some generic DB. Imagine the app migrating to IMDG. > > I am pretty sure that in many cases approach with data streamer will be > faster and in many cases approach with multiple queries will be faster. And > the choice should depend on many factors. I like Val's suggestions. I think > he goes in the right direction. > > --Yakov >
Re: IgniteCache.loadCache improvement proposal
> > > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. > Because this can be some generic DB. Imagine the app migrating to IMDG. I am pretty sure that in many cases approach with data streamer will be faster and in many cases approach with multiple queries will be faster. And the choice should depend on many factors. I like Val's suggestions. I think he goes in the right direction. --Yakov
Re: IgniteCache.loadCache improvement proposal
Alexandr, This has been tested many times already by our users and the answer is simple - it depends :) Any approach has its pros and cons and you never know which one will better for particular use case, database, data model, hardware, etc. Having said that, you will never find the best way to load the data, because it just doesn't exist. What I propose is just to make the API more generic and give user even more control than they have now. -Val On Fri, Nov 18, 2016 at 6:53 AM, Alexandr Kuramshinwrote: > Dmitriy, > > I will not be fully confident that partition ID is the best approach in all > cases. Even if we have full access to the database structure, there are > another problems. > > Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR, > AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER. > > While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE), > IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE) > for pre-loading at startup, for example, recently employed persons. > > And if we'd like to query filtered data from the database, we'd also have > to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME), > IDX8(PART, AGE). So we doubling overhead is defined by indexes. > > After this modifications on the database has been done and the PART column > is filled, what we should do to preload the data? > > We should perform so many database queries so many partitions are stored on > the nodes. Number of queries would be 1024 by default settings in the > affinity functions. Some calls may not return any data at all, and it will > be a vain network round-trip. Also it may be a problem for some databases > to effectively perform number of parallel queries without a degradation on > the total throughput. > > DataStreamer approach may be faster, but it should be tested. > > 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan : > > > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov > > wrote: > > > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov > > > > wrote: > > > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > > that > > > > > > > > > loadCache(...) method, as it is right now, provides a way to > avoid > > > it. > > > > > > > > > > > > > > > > No, it does not. > > > > > > > > > > > > Yes it does :) > > > > > > No it doesn't. Load cache should either send a query to DB that filters > > all > > > the data on server side which, in turn, may result to full-scan of 2 Tb > > > data set dozens of times (equal to node count) or send a query that > > brings > > > the whole dataset to each node which is unacceptable as well. > > > > > > > Why not store the partition ID in the database and query only local > > partitions? Whatever approach we design with a DataStreamer will be > slower > > than this. > > > > > > -- > Thanks, > Alexandr Kuramshin >
Re: IgniteCache.loadCache improvement proposal
Dmitriy, I will not be fully confident that partition ID is the best approach in all cases. Even if we have full access to the database structure, there are another problems. Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR, AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER. While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE), IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE) for pre-loading at startup, for example, recently employed persons. And if we'd like to query filtered data from the database, we'd also have to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME), IDX8(PART, AGE). So we doubling overhead is defined by indexes. After this modifications on the database has been done and the PART column is filled, what we should do to preload the data? We should perform so many database queries so many partitions are stored on the nodes. Number of queries would be 1024 by default settings in the affinity functions. Some calls may not return any data at all, and it will be a vain network round-trip. Also it may be a problem for some databases to effectively perform number of parallel queries without a degradation on the total throughput. DataStreamer approach may be faster, but it should be tested. 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan: > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov > wrote: > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov > > wrote: > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > that > > > > > > > loadCache(...) method, as it is right now, provides a way to avoid > > it. > > > > > > > > > > > > No, it does not. > > > > > > > > > Yes it does :) > > > > No it doesn't. Load cache should either send a query to DB that filters > all > > the data on server side which, in turn, may result to full-scan of 2 Tb > > data set dozens of times (equal to node count) or send a query that > brings > > the whole dataset to each node which is unacceptable as well. > > > > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. > -- Thanks, Alexandr Kuramshin
Re: IgniteCache.loadCache improvement proposal
Alexandr, 'local' prefix in Ignite APIs means that the method is invoked only on the current node, while its regular sibling is invoked in distributed fashion. localLoadCache doesn't imply that only local partitions are loaded. it turns out to work this way right now, but it doesn't mean that this can't be change (and I don't suggest to change default behavior, BTW). Method overhead is decreased with my approach, if used properly. You can call localLoadCache with the data streamer based closure, and the database will be queried only from local node, and the local node will then distribute the data across other nodes. All I did is abstracted this logic of moving an entry from store to cache, because currently user doesn't have an option to override it. If you still believe this doesn't work, can you please elaborate what exactly you propose? What code should we add and/or change in Ignite and how user will use it API wise? -Val On Wed, Nov 16, 2016 at 5:40 AM, Dmitriy Setrakyanwrote: > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov > wrote: > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov > > wrote: > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > that > > > > > > > loadCache(...) method, as it is right now, provides a way to avoid > > it. > > > > > > > > > > > > No, it does not. > > > > > > > > > Yes it does :) > > > > No it doesn't. Load cache should either send a query to DB that filters > all > > the data on server side which, in turn, may result to full-scan of 2 Tb > > data set dozens of times (equal to node count) or send a query that > brings > > the whole dataset to each node which is unacceptable as well. > > > > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. >
Re: IgniteCache.loadCache improvement proposal
On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanovwrote: > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov > wrote: > > > > > Yakov, I agree that such scenario should be avoided. I also think > that > > > > > loadCache(...) method, as it is right now, provides a way to avoid > it. > > > > > > > > No, it does not. > > > > > > Yes it does :) > > No it doesn't. Load cache should either send a query to DB that filters all > the data on server side which, in turn, may result to full-scan of 2 Tb > data set dozens of times (equal to node count) or send a query that brings > the whole dataset to each node which is unacceptable as well. > Why not store the partition ID in the database and query only local partitions? Whatever approach we design with a DataStreamer will be slower than this.
Re: IgniteCache.loadCache improvement proposal
> On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanovwrote: > > > Yakov, I agree that such scenario should be avoided. I also think that > > > loadCache(...) method, as it is right now, provides a way to avoid it. > > > > No, it does not. > > > Yes it does :) No it doesn't. Load cache should either send a query to DB that filters all the data on server side which, in turn, may result to full-scan of 2 Tb data set dozens of times (equal to node count) or send a query that brings the whole dataset to each node which is unacceptable as well. --Yakov
Re: IgniteCache.loadCache improvement proposal
On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanovwrote: > > Yakov, I agree that such scenario should be avoided. I also think that > > loadCache(...) method, as it is right now, provides a way to avoid it. > > No, it does not. > Yes it does :)
Re: IgniteCache.loadCache improvement proposal
> Yakov, I agree that such scenario should be avoided. I also think that > loadCache(...) method, as it is right now, provides a way to avoid it. No, it does not. --Yakov
Re: IgniteCache.loadCache improvement proposal
Hi all, Denis, thank you for the explanation, your understanding of the question is the most closest to mine. The extension of the method IgniteCache.loadCache by adding an IgniteClosure is a handy feature which may be useful in some cases, but not addresses the problem of extensive network utilization. Actually I vote against that extension - uses of that method will have the same overhead on the network. IgniteCache.localLoadCache, as its name tells, should only load entities for the local cache partitions, and the such filtering should be done before invoking the predicate, to minimize the unnecessary analyzing of the entities will not be stored in the cache. So extension of the method with IgniteClosure does not resolve the problem, because the IgniteClosure should be called after the IgnitePredicate has done its filtering. The last argument, is that any extension of the API does not affect last usages of the non-optimized method IgniteCache.loadCache. And my wish and my will are to re-implement the IgniteCache.loadCache. After the re-implementation has been done, we can extend the API by adding additional arguments like IgniteClosure to make cache store operations customizable. 2016-11-16 3:51 GMT+03:00 Denis Magda: > Val, > > Then I would create a blog post on how to use the new API proposed by you > to accomplish the scenario described by Alexandr. Are you willing to write > the post once the API is implemented? > > Alexandr, do you think the API proposed by Val will resolve your case when > it’s used as listed below? If it’s so are you interested to take over the > implementation and contribute to Apache Ignite? > > — > Denis > > > On Nov 15, 2016, at 2:30 PM, Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > > > > Denis, > > > > The loading will be most likely initiated by the application anyway, even > > if you call localLoadCache on one of the server nodes. I.e. the flow is > the > > following: > > > > 1. Client sends a closure to a server node (e.g. oldest or random). > > 2. The closure calls localLoadCache method. > > 3. If this server node fails (or if the loading process fails), client > > gets an exception and retries if needed. > > > > I would not complicate the API and implementation even more. We have > > compute grid API that already allows to handle things you're describing. > > It's very flexible and easy to use. > > > > -Val > > > > On Tue, Nov 15, 2016 at 2:20 PM, Denis Magda wrote: > > > >> Well, that’s clear. However, with localLoadCache the user still has to > >> care about the fault-tolerance if the node that loads the data goes > down. > >> What if we provide an overloaded version of loadCache that will accept a > >> number of nodes where the closure has to be executed? If the number > >> decreases then the engine will re-execute the closure on a node that is > >> alive. > >> > >> — > >> Denis > >> > >> > >>> On Nov 15, 2016, at 2:06 PM, Valentin Kulichenko < > >> valentin.kuliche...@gmail.com> wrote: > >>> > >>> You can use localLoadCache method for this (it should be overloaded as > >> well > >>> of course). Basically, if you provide closure based on > IgniteDataStreamer > >>> and call localLoadCache on one of the nodes (client or server), it's > the > >>> same approach as described in [1], but with the possibility to reuse > >>> existing persistence code. Makes sense? > >>> > >>> [1] https://apacheignite.readme.io/docs/data-loading# > ignitedatastreamer > >>> > >>> -Val > >>> > >>> On Tue, Nov 15, 2016 at 1:15 PM, Denis Magda > wrote: > >>> > How would your proposal resolve the main point Aleksandr is trying to > convey that is extensive network utilization? > > As I see the loadCache method still will be triggered on every and as > before all the nodes will pre-load all the data set from a database. > >> That > was Aleksandr’s reasonable concern. > > If we make up a way how to call the loadCache on a specific node only > >> and > implement some falt-tolerant mechanism then your suggestion should > work > perfectly fine. > > — > Denis > > > On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > > > > It sounds like Aleksandr is basically proposing to support automatic > > persistence [1] for loading through data streamer and we really don't > have > > this. However, I think I have more generic solution in mind. > > > > What if we add one more IgniteCache.loadCache overload like this: > > > > loadCache(@Nullable IgniteBiPredicate p, IgniteBiInClosure V> > > clo, @Nullable > > Object... args) > > > > It's the same as the existing one, but with the key-value closure > provided > > as a parameter. This closure will be passed to the > CacheStore.loadCache > > along with the arguments and will allow
Re: IgniteCache.loadCache improvement proposal
Val, Then I would create a blog post on how to use the new API proposed by you to accomplish the scenario described by Alexandr. Are you willing to write the post once the API is implemented? Alexandr, do you think the API proposed by Val will resolve your case when it’s used as listed below? If it’s so are you interested to take over the implementation and contribute to Apache Ignite? — Denis > On Nov 15, 2016, at 2:30 PM, Valentin Kulichenko >wrote: > > Denis, > > The loading will be most likely initiated by the application anyway, even > if you call localLoadCache on one of the server nodes. I.e. the flow is the > following: > > 1. Client sends a closure to a server node (e.g. oldest or random). > 2. The closure calls localLoadCache method. > 3. If this server node fails (or if the loading process fails), client > gets an exception and retries if needed. > > I would not complicate the API and implementation even more. We have > compute grid API that already allows to handle things you're describing. > It's very flexible and easy to use. > > -Val > > On Tue, Nov 15, 2016 at 2:20 PM, Denis Magda wrote: > >> Well, that’s clear. However, with localLoadCache the user still has to >> care about the fault-tolerance if the node that loads the data goes down. >> What if we provide an overloaded version of loadCache that will accept a >> number of nodes where the closure has to be executed? If the number >> decreases then the engine will re-execute the closure on a node that is >> alive. >> >> — >> Denis >> >> >>> On Nov 15, 2016, at 2:06 PM, Valentin Kulichenko < >> valentin.kuliche...@gmail.com> wrote: >>> >>> You can use localLoadCache method for this (it should be overloaded as >> well >>> of course). Basically, if you provide closure based on IgniteDataStreamer >>> and call localLoadCache on one of the nodes (client or server), it's the >>> same approach as described in [1], but with the possibility to reuse >>> existing persistence code. Makes sense? >>> >>> [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer >>> >>> -Val >>> >>> On Tue, Nov 15, 2016 at 1:15 PM, Denis Magda wrote: >>> How would your proposal resolve the main point Aleksandr is trying to convey that is extensive network utilization? As I see the loadCache method still will be triggered on every and as before all the nodes will pre-load all the data set from a database. >> That was Aleksandr’s reasonable concern. If we make up a way how to call the loadCache on a specific node only >> and implement some falt-tolerant mechanism then your suggestion should work perfectly fine. — Denis > On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko < valentin.kuliche...@gmail.com> wrote: > > It sounds like Aleksandr is basically proposing to support automatic > persistence [1] for loading through data streamer and we really don't have > this. However, I think I have more generic solution in mind. > > What if we add one more IgniteCache.loadCache overload like this: > > loadCache(@Nullable IgniteBiPredicate p, IgniteBiInClosure > clo, @Nullable > Object... args) > > It's the same as the existing one, but with the key-value closure provided > as a parameter. This closure will be passed to the CacheStore.loadCache > along with the arguments and will allow to override the logic that actually > saves the loaded entry in cache (currently this logic is always >> provided by > the cache itself and user can't control it). > > We can then provide the implementation of this closure that will >> create a > data streamer and call addData() within its apply() method. > > I see the following advantages: > > - Any existing CacheStore implementation can be reused to load through > streamer (our JDBC and Cassandra stores or anything else that user has). > - Loading code is always part of CacheStore implementation, so it's very > easy to switch between different ways of loading. > - User is not limited by two approaches we provide out of the box, >> they > can always implement a new one. > > Thoughts? > > [1] https://apacheignite.readme.io/docs/automatic-persistence > > -Val > > On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov < >> akuznet...@apache.org > > wrote: > >> Hi, All! >> >> I think we do not need to chage API at all. >> >> public void loadCache(@Nullable IgniteBiPredicate p, @Nullable >> Object... args) throws CacheException; >> >> We could pass any args to loadCache(); >> >> So we could create class >> IgniteCacheLoadDescriptor { >> some fields that will describe how to load >> } >>
Re: IgniteCache.loadCache improvement proposal
Denis, The loading will be most likely initiated by the application anyway, even if you call localLoadCache on one of the server nodes. I.e. the flow is the following: 1. Client sends a closure to a server node (e.g. oldest or random). 2. The closure calls localLoadCache method. 3. If this server node fails (or if the loading process fails), client gets an exception and retries if needed. I would not complicate the API and implementation even more. We have compute grid API that already allows to handle things you're describing. It's very flexible and easy to use. -Val On Tue, Nov 15, 2016 at 2:20 PM, Denis Magdawrote: > Well, that’s clear. However, with localLoadCache the user still has to > care about the fault-tolerance if the node that loads the data goes down. > What if we provide an overloaded version of loadCache that will accept a > number of nodes where the closure has to be executed? If the number > decreases then the engine will re-execute the closure on a node that is > alive. > > — > Denis > > > > On Nov 15, 2016, at 2:06 PM, Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > > > > You can use localLoadCache method for this (it should be overloaded as > well > > of course). Basically, if you provide closure based on IgniteDataStreamer > > and call localLoadCache on one of the nodes (client or server), it's the > > same approach as described in [1], but with the possibility to reuse > > existing persistence code. Makes sense? > > > > [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer > > > > -Val > > > > On Tue, Nov 15, 2016 at 1:15 PM, Denis Magda wrote: > > > >> How would your proposal resolve the main point Aleksandr is trying to > >> convey that is extensive network utilization? > >> > >> As I see the loadCache method still will be triggered on every and as > >> before all the nodes will pre-load all the data set from a database. > That > >> was Aleksandr’s reasonable concern. > >> > >> If we make up a way how to call the loadCache on a specific node only > and > >> implement some falt-tolerant mechanism then your suggestion should work > >> perfectly fine. > >> > >> — > >> Denis > >> > >>> On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko < > >> valentin.kuliche...@gmail.com> wrote: > >>> > >>> It sounds like Aleksandr is basically proposing to support automatic > >>> persistence [1] for loading through data streamer and we really don't > >> have > >>> this. However, I think I have more generic solution in mind. > >>> > >>> What if we add one more IgniteCache.loadCache overload like this: > >>> > >>> loadCache(@Nullable IgniteBiPredicate p, IgniteBiInClosure > >>> clo, @Nullable > >>> Object... args) > >>> > >>> It's the same as the existing one, but with the key-value closure > >> provided > >>> as a parameter. This closure will be passed to the CacheStore.loadCache > >>> along with the arguments and will allow to override the logic that > >> actually > >>> saves the loaded entry in cache (currently this logic is always > provided > >> by > >>> the cache itself and user can't control it). > >>> > >>> We can then provide the implementation of this closure that will > create a > >>> data streamer and call addData() within its apply() method. > >>> > >>> I see the following advantages: > >>> > >>> - Any existing CacheStore implementation can be reused to load through > >>> streamer (our JDBC and Cassandra stores or anything else that user > >> has). > >>> - Loading code is always part of CacheStore implementation, so it's > >> very > >>> easy to switch between different ways of loading. > >>> - User is not limited by two approaches we provide out of the box, > they > >>> can always implement a new one. > >>> > >>> Thoughts? > >>> > >>> [1] https://apacheignite.readme.io/docs/automatic-persistence > >>> > >>> -Val > >>> > >>> On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov < > akuznet...@apache.org > >>> > >>> wrote: > >>> > Hi, All! > > I think we do not need to chage API at all. > > public void loadCache(@Nullable IgniteBiPredicate p, @Nullable > Object... args) throws CacheException; > > We could pass any args to loadCache(); > > So we could create class > IgniteCacheLoadDescriptor { > some fields that will describe how to load > } > > > and modify POJO store to detect and use such arguments. > > > All we need is to implement this and write good documentation and > >> examples. > > Thoughts? > > On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin < > >> ein.nsk...@gmail.com> > wrote: > > > Hi Vladimir, > > > > I don't offer any changes in API. Usage scenario is the same as it > was > > described in > > https://apacheignite.readme.io/docs/persistent-store# > >> section-loadcache- > > > > The preload cache logic invokes
Re: IgniteCache.loadCache improvement proposal
You can use localLoadCache method for this (it should be overloaded as well of course). Basically, if you provide closure based on IgniteDataStreamer and call localLoadCache on one of the nodes (client or server), it's the same approach as described in [1], but with the possibility to reuse existing persistence code. Makes sense? [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer -Val On Tue, Nov 15, 2016 at 1:15 PM, Denis Magdawrote: > How would your proposal resolve the main point Aleksandr is trying to > convey that is extensive network utilization? > > As I see the loadCache method still will be triggered on every and as > before all the nodes will pre-load all the data set from a database. That > was Aleksandr’s reasonable concern. > > If we make up a way how to call the loadCache on a specific node only and > implement some falt-tolerant mechanism then your suggestion should work > perfectly fine. > > — > Denis > > > On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > > > > It sounds like Aleksandr is basically proposing to support automatic > > persistence [1] for loading through data streamer and we really don't > have > > this. However, I think I have more generic solution in mind. > > > > What if we add one more IgniteCache.loadCache overload like this: > > > > loadCache(@Nullable IgniteBiPredicate p, IgniteBiInClosure > > clo, @Nullable > > Object... args) > > > > It's the same as the existing one, but with the key-value closure > provided > > as a parameter. This closure will be passed to the CacheStore.loadCache > > along with the arguments and will allow to override the logic that > actually > > saves the loaded entry in cache (currently this logic is always provided > by > > the cache itself and user can't control it). > > > > We can then provide the implementation of this closure that will create a > > data streamer and call addData() within its apply() method. > > > > I see the following advantages: > > > > - Any existing CacheStore implementation can be reused to load through > > streamer (our JDBC and Cassandra stores or anything else that user > has). > > - Loading code is always part of CacheStore implementation, so it's > very > > easy to switch between different ways of loading. > > - User is not limited by two approaches we provide out of the box, they > > can always implement a new one. > > > > Thoughts? > > > > [1] https://apacheignite.readme.io/docs/automatic-persistence > > > > -Val > > > > On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov > > > wrote: > > > >> Hi, All! > >> > >> I think we do not need to chage API at all. > >> > >> public void loadCache(@Nullable IgniteBiPredicate p, @Nullable > >> Object... args) throws CacheException; > >> > >> We could pass any args to loadCache(); > >> > >> So we could create class > >> IgniteCacheLoadDescriptor { > >> some fields that will describe how to load > >> } > >> > >> > >> and modify POJO store to detect and use such arguments. > >> > >> > >> All we need is to implement this and write good documentation and > examples. > >> > >> Thoughts? > >> > >> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin < > ein.nsk...@gmail.com> > >> wrote: > >> > >>> Hi Vladimir, > >>> > >>> I don't offer any changes in API. Usage scenario is the same as it was > >>> described in > >>> https://apacheignite.readme.io/docs/persistent-store# > section-loadcache- > >>> > >>> The preload cache logic invokes IgniteCache.loadCache() with some > >>> additional arguments, depending on a CacheStore implementation, and > then > >>> the loading occurs in the way I've already described. > >>> > >>> > >>> 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov : > >>> > Hi Alex, > > >>> Let's give the user the reusable code which is convenient, reliable > >>> and > fast. > Convenience - this is why I asked for example on how API can look like > >>> and > how users are going to use it. > > Vladimir. > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin < > >>> ein.nsk...@gmail.com > > > wrote: > > > Hi all, > > > > I think the discussion goes a wrong direction. Certainly it's not a > >> big > > deal to implement some custom user logic to load the data into > >> caches. > But > > Ignite framework gives the user some reusable code build on top of > >> the > > basic system. > > > > So the main question is: Why developers let the user to use > >> convenient > way > > to load caches with totally non-optimal solution? > > > > We could talk too much about different persistence storage types, but > > whenever we initiate the loading with IgniteCache.loadCache the > >> current > > implementation imposes much overhead on the network. > > > > Partition-aware data loading may be used in some
Re: IgniteCache.loadCache improvement proposal
How would your proposal resolve the main point Aleksandr is trying to convey that is extensive network utilization? As I see the loadCache method still will be triggered on every and as before all the nodes will pre-load all the data set from a database. That was Aleksandr’s reasonable concern. If we make up a way how to call the loadCache on a specific node only and implement some falt-tolerant mechanism then your suggestion should work perfectly fine. — Denis > On Nov 15, 2016, at 12:05 PM, Valentin Kulichenko >wrote: > > It sounds like Aleksandr is basically proposing to support automatic > persistence [1] for loading through data streamer and we really don't have > this. However, I think I have more generic solution in mind. > > What if we add one more IgniteCache.loadCache overload like this: > > loadCache(@Nullable IgniteBiPredicate p, IgniteBiInClosure > clo, @Nullable > Object... args) > > It's the same as the existing one, but with the key-value closure provided > as a parameter. This closure will be passed to the CacheStore.loadCache > along with the arguments and will allow to override the logic that actually > saves the loaded entry in cache (currently this logic is always provided by > the cache itself and user can't control it). > > We can then provide the implementation of this closure that will create a > data streamer and call addData() within its apply() method. > > I see the following advantages: > > - Any existing CacheStore implementation can be reused to load through > streamer (our JDBC and Cassandra stores or anything else that user has). > - Loading code is always part of CacheStore implementation, so it's very > easy to switch between different ways of loading. > - User is not limited by two approaches we provide out of the box, they > can always implement a new one. > > Thoughts? > > [1] https://apacheignite.readme.io/docs/automatic-persistence > > -Val > > On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov > wrote: > >> Hi, All! >> >> I think we do not need to chage API at all. >> >> public void loadCache(@Nullable IgniteBiPredicate p, @Nullable >> Object... args) throws CacheException; >> >> We could pass any args to loadCache(); >> >> So we could create class >> IgniteCacheLoadDescriptor { >> some fields that will describe how to load >> } >> >> >> and modify POJO store to detect and use such arguments. >> >> >> All we need is to implement this and write good documentation and examples. >> >> Thoughts? >> >> On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin >> wrote: >> >>> Hi Vladimir, >>> >>> I don't offer any changes in API. Usage scenario is the same as it was >>> described in >>> https://apacheignite.readme.io/docs/persistent-store#section-loadcache- >>> >>> The preload cache logic invokes IgniteCache.loadCache() with some >>> additional arguments, depending on a CacheStore implementation, and then >>> the loading occurs in the way I've already described. >>> >>> >>> 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov : >>> Hi Alex, >>> Let's give the user the reusable code which is convenient, reliable >>> and fast. Convenience - this is why I asked for example on how API can look like >>> and how users are going to use it. Vladimir. On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin < >>> ein.nsk...@gmail.com > wrote: > Hi all, > > I think the discussion goes a wrong direction. Certainly it's not a >> big > deal to implement some custom user logic to load the data into >> caches. But > Ignite framework gives the user some reusable code build on top of >> the > basic system. > > So the main question is: Why developers let the user to use >> convenient way > to load caches with totally non-optimal solution? > > We could talk too much about different persistence storage types, but > whenever we initiate the loading with IgniteCache.loadCache the >> current > implementation imposes much overhead on the network. > > Partition-aware data loading may be used in some scenarios to avoid >>> this > network overhead, but the users are compelled to do additional steps >> to > achieve this optimization: adding the column to tables, adding >> compound > indices including the added column, write a peace of repeatable code >> to > load the data in different caches in fault-tolerant fashion, etc. > > Let's give the user the reusable code which is convenient, reliable >> and > fast. > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > valentin.kuliche...@gmail.com>: > >> Hi Aleksandr, >> >> Data streamer is already outlined as one of the possible approaches >>> for >> loading the data [1]. Basically, you start a
Re: IgniteCache.loadCache improvement proposal
It sounds like Aleksandr is basically proposing to support automatic persistence [1] for loading through data streamer and we really don't have this. However, I think I have more generic solution in mind. What if we add one more IgniteCache.loadCache overload like this: loadCache(@Nullable IgniteBiPredicatep, IgniteBiInClosure clo, @Nullable Object... args) It's the same as the existing one, but with the key-value closure provided as a parameter. This closure will be passed to the CacheStore.loadCache along with the arguments and will allow to override the logic that actually saves the loaded entry in cache (currently this logic is always provided by the cache itself and user can't control it). We can then provide the implementation of this closure that will create a data streamer and call addData() within its apply() method. I see the following advantages: - Any existing CacheStore implementation can be reused to load through streamer (our JDBC and Cassandra stores or anything else that user has). - Loading code is always part of CacheStore implementation, so it's very easy to switch between different ways of loading. - User is not limited by two approaches we provide out of the box, they can always implement a new one. Thoughts? [1] https://apacheignite.readme.io/docs/automatic-persistence -Val On Tue, Nov 15, 2016 at 2:27 AM, Alexey Kuznetsov wrote: > Hi, All! > > I think we do not need to chage API at all. > > public void loadCache(@Nullable IgniteBiPredicate p, @Nullable > Object... args) throws CacheException; > > We could pass any args to loadCache(); > > So we could create class > IgniteCacheLoadDescriptor { > some fields that will describe how to load > } > > > and modify POJO store to detect and use such arguments. > > > All we need is to implement this and write good documentation and examples. > > Thoughts? > > On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin > wrote: > > > Hi Vladimir, > > > > I don't offer any changes in API. Usage scenario is the same as it was > > described in > > https://apacheignite.readme.io/docs/persistent-store#section-loadcache- > > > > The preload cache logic invokes IgniteCache.loadCache() with some > > additional arguments, depending on a CacheStore implementation, and then > > the loading occurs in the way I've already described. > > > > > > 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov : > > > > > Hi Alex, > > > > > > >>> Let's give the user the reusable code which is convenient, reliable > > and > > > fast. > > > Convenience - this is why I asked for example on how API can look like > > and > > > how users are going to use it. > > > > > > Vladimir. > > > > > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin < > > ein.nsk...@gmail.com > > > > > > > wrote: > > > > > > > Hi all, > > > > > > > > I think the discussion goes a wrong direction. Certainly it's not a > big > > > > deal to implement some custom user logic to load the data into > caches. > > > But > > > > Ignite framework gives the user some reusable code build on top of > the > > > > basic system. > > > > > > > > So the main question is: Why developers let the user to use > convenient > > > way > > > > to load caches with totally non-optimal solution? > > > > > > > > We could talk too much about different persistence storage types, but > > > > whenever we initiate the loading with IgniteCache.loadCache the > current > > > > implementation imposes much overhead on the network. > > > > > > > > Partition-aware data loading may be used in some scenarios to avoid > > this > > > > network overhead, but the users are compelled to do additional steps > to > > > > achieve this optimization: adding the column to tables, adding > compound > > > > indices including the added column, write a peace of repeatable code > to > > > > load the data in different caches in fault-tolerant fashion, etc. > > > > > > > > Let's give the user the reusable code which is convenient, reliable > and > > > > fast. > > > > > > > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > > > > valentin.kuliche...@gmail.com>: > > > > > > > > > Hi Aleksandr, > > > > > > > > > > Data streamer is already outlined as one of the possible approaches > > for > > > > > loading the data [1]. Basically, you start a designated client node > > or > > > > > chose a leader among server nodes [1] and then use > IgniteDataStreamer > > > API > > > > > to load the data. With this approach there is no need to have the > > > > > CacheStore implementation at all. Can you please elaborate what > > > > additional > > > > > value are you trying to add here? > > > > > > > > > > [1] https://apacheignite.readme.io/docs/data-loading# > > > ignitedatastreamer > > > > > [2] https://apacheignite.readme.io/docs/leader-election > > > > > > > > > > -Val > > > > > > > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < > > > > dsetrak...@apache.org>
Re: IgniteCache.loadCache improvement proposal
Hi, All! I think we do not need to chage API at all. public void loadCache(@Nullable IgniteBiPredicatep, @Nullable Object... args) throws CacheException; We could pass any args to loadCache(); So we could create class IgniteCacheLoadDescriptor { some fields that will describe how to load } and modify POJO store to detect and use such arguments. All we need is to implement this and write good documentation and examples. Thoughts? On Tue, Nov 15, 2016 at 5:22 PM, Alexandr Kuramshin wrote: > Hi Vladimir, > > I don't offer any changes in API. Usage scenario is the same as it was > described in > https://apacheignite.readme.io/docs/persistent-store#section-loadcache- > > The preload cache logic invokes IgniteCache.loadCache() with some > additional arguments, depending on a CacheStore implementation, and then > the loading occurs in the way I've already described. > > > 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov : > > > Hi Alex, > > > > >>> Let's give the user the reusable code which is convenient, reliable > and > > fast. > > Convenience - this is why I asked for example on how API can look like > and > > how users are going to use it. > > > > Vladimir. > > > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin < > ein.nsk...@gmail.com > > > > > wrote: > > > > > Hi all, > > > > > > I think the discussion goes a wrong direction. Certainly it's not a big > > > deal to implement some custom user logic to load the data into caches. > > But > > > Ignite framework gives the user some reusable code build on top of the > > > basic system. > > > > > > So the main question is: Why developers let the user to use convenient > > way > > > to load caches with totally non-optimal solution? > > > > > > We could talk too much about different persistence storage types, but > > > whenever we initiate the loading with IgniteCache.loadCache the current > > > implementation imposes much overhead on the network. > > > > > > Partition-aware data loading may be used in some scenarios to avoid > this > > > network overhead, but the users are compelled to do additional steps to > > > achieve this optimization: adding the column to tables, adding compound > > > indices including the added column, write a peace of repeatable code to > > > load the data in different caches in fault-tolerant fashion, etc. > > > > > > Let's give the user the reusable code which is convenient, reliable and > > > fast. > > > > > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > > > valentin.kuliche...@gmail.com>: > > > > > > > Hi Aleksandr, > > > > > > > > Data streamer is already outlined as one of the possible approaches > for > > > > loading the data [1]. Basically, you start a designated client node > or > > > > chose a leader among server nodes [1] and then use IgniteDataStreamer > > API > > > > to load the data. With this approach there is no need to have the > > > > CacheStore implementation at all. Can you please elaborate what > > > additional > > > > value are you trying to add here? > > > > > > > > [1] https://apacheignite.readme.io/docs/data-loading# > > ignitedatastreamer > > > > [2] https://apacheignite.readme.io/docs/leader-election > > > > > > > > -Val > > > > > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < > > > dsetrak...@apache.org> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I just want to clarify a couple of API details from the original > > email > > > to > > > > > make sure that we are making the right assumptions here. > > > > > > > > > > *"Because of none keys are passed to the CacheStore.loadCache > > methods, > > > > the > > > > > > underlying implementation is forced to read all the data from the > > > > > > persistence storage"* > > > > > > > > > > > > > > > According to the javadoc, loadCache(...) method receives an > optional > > > > > argument from the user. You can pass anything you like, including a > > > list > > > > of > > > > > keys, or an SQL where clause, etc. > > > > > > > > > > *"The partition-aware data loading approach is not a choice. It > > > requires > > > > > > persistence of the volatile data depended on affinity function > > > > > > implementation and settings."* > > > > > > > > > > > > > > > This is only partially true. While Ignite allows to plugin custom > > > > affinity > > > > > functions, the affinity function is not something that changes > > > > dynamically > > > > > and should always return the same partition for the same key.So, > the > > > > > partition assignments are not volatile at all. If, in some very > rare > > > > case, > > > > > the partition assignment logic needs to change, then you could > update > > > the > > > > > partition assignments that you may have persisted elsewhere as > well, > > > e.g. > > > > > database. > > > > > > > > > > D. > > > > > > > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov < > > > voze...@gridgain.com> > > > > > wrote: > > > > > > > > > > > Alexandr, Alexey, >
Re: IgniteCache.loadCache improvement proposal
Hi Vladimir, I don't offer any changes in API. Usage scenario is the same as it was described in https://apacheignite.readme.io/docs/persistent-store#section-loadcache- The preload cache logic invokes IgniteCache.loadCache() with some additional arguments, depending on a CacheStore implementation, and then the loading occurs in the way I've already described. 2016-11-15 11:26 GMT+03:00 Vladimir Ozerov: > Hi Alex, > > >>> Let's give the user the reusable code which is convenient, reliable and > fast. > Convenience - this is why I asked for example on how API can look like and > how users are going to use it. > > Vladimir. > > On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin > > wrote: > > > Hi all, > > > > I think the discussion goes a wrong direction. Certainly it's not a big > > deal to implement some custom user logic to load the data into caches. > But > > Ignite framework gives the user some reusable code build on top of the > > basic system. > > > > So the main question is: Why developers let the user to use convenient > way > > to load caches with totally non-optimal solution? > > > > We could talk too much about different persistence storage types, but > > whenever we initiate the loading with IgniteCache.loadCache the current > > implementation imposes much overhead on the network. > > > > Partition-aware data loading may be used in some scenarios to avoid this > > network overhead, but the users are compelled to do additional steps to > > achieve this optimization: adding the column to tables, adding compound > > indices including the added column, write a peace of repeatable code to > > load the data in different caches in fault-tolerant fashion, etc. > > > > Let's give the user the reusable code which is convenient, reliable and > > fast. > > > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > > valentin.kuliche...@gmail.com>: > > > > > Hi Aleksandr, > > > > > > Data streamer is already outlined as one of the possible approaches for > > > loading the data [1]. Basically, you start a designated client node or > > > chose a leader among server nodes [1] and then use IgniteDataStreamer > API > > > to load the data. With this approach there is no need to have the > > > CacheStore implementation at all. Can you please elaborate what > > additional > > > value are you trying to add here? > > > > > > [1] https://apacheignite.readme.io/docs/data-loading# > ignitedatastreamer > > > [2] https://apacheignite.readme.io/docs/leader-election > > > > > > -Val > > > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < > > dsetrak...@apache.org> > > > wrote: > > > > > > > Hi, > > > > > > > > I just want to clarify a couple of API details from the original > email > > to > > > > make sure that we are making the right assumptions here. > > > > > > > > *"Because of none keys are passed to the CacheStore.loadCache > methods, > > > the > > > > > underlying implementation is forced to read all the data from the > > > > > persistence storage"* > > > > > > > > > > > > According to the javadoc, loadCache(...) method receives an optional > > > > argument from the user. You can pass anything you like, including a > > list > > > of > > > > keys, or an SQL where clause, etc. > > > > > > > > *"The partition-aware data loading approach is not a choice. It > > requires > > > > > persistence of the volatile data depended on affinity function > > > > > implementation and settings."* > > > > > > > > > > > > This is only partially true. While Ignite allows to plugin custom > > > affinity > > > > functions, the affinity function is not something that changes > > > dynamically > > > > and should always return the same partition for the same key.So, the > > > > partition assignments are not volatile at all. If, in some very rare > > > case, > > > > the partition assignment logic needs to change, then you could update > > the > > > > partition assignments that you may have persisted elsewhere as well, > > e.g. > > > > database. > > > > > > > > D. > > > > > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov < > > voze...@gridgain.com> > > > > wrote: > > > > > > > > > Alexandr, Alexey, > > > > > > > > > > While I agree with you that current cache loading logic is far from > > > > ideal, > > > > > it would be cool to see API drafts based on your suggestions to get > > > > better > > > > > understanding of your ideas. How exactly users are going to use > your > > > > > suggestions? > > > > > > > > > > My main concern is that initial load is not very trivial task in > > > general > > > > > case. Some users have centralized RDBMS systems, some have NoSQL, > > > others > > > > > work with distributed persistent stores (e.g. HDFS). Sometimes we > > have > > > > > Ignite nodes "near" persistent data, sometimes we don't. Sharding, > > > > > affinity, co-location, etc.. If we try to support all (or many) > cases > > > out > > > > > of the box, we may end up in very messy and
Re: IgniteCache.loadCache improvement proposal
Hi Alex, >>> Let's give the user the reusable code which is convenient, reliable and fast. Convenience - this is why I asked for example on how API can look like and how users are going to use it. Vladimir. On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshinwrote: > Hi all, > > I think the discussion goes a wrong direction. Certainly it's not a big > deal to implement some custom user logic to load the data into caches. But > Ignite framework gives the user some reusable code build on top of the > basic system. > > So the main question is: Why developers let the user to use convenient way > to load caches with totally non-optimal solution? > > We could talk too much about different persistence storage types, but > whenever we initiate the loading with IgniteCache.loadCache the current > implementation imposes much overhead on the network. > > Partition-aware data loading may be used in some scenarios to avoid this > network overhead, but the users are compelled to do additional steps to > achieve this optimization: adding the column to tables, adding compound > indices including the added column, write a peace of repeatable code to > load the data in different caches in fault-tolerant fashion, etc. > > Let's give the user the reusable code which is convenient, reliable and > fast. > > 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < > valentin.kuliche...@gmail.com>: > > > Hi Aleksandr, > > > > Data streamer is already outlined as one of the possible approaches for > > loading the data [1]. Basically, you start a designated client node or > > chose a leader among server nodes [1] and then use IgniteDataStreamer API > > to load the data. With this approach there is no need to have the > > CacheStore implementation at all. Can you please elaborate what > additional > > value are you trying to add here? > > > > [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer > > [2] https://apacheignite.readme.io/docs/leader-election > > > > -Val > > > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan < > dsetrak...@apache.org> > > wrote: > > > > > Hi, > > > > > > I just want to clarify a couple of API details from the original email > to > > > make sure that we are making the right assumptions here. > > > > > > *"Because of none keys are passed to the CacheStore.loadCache methods, > > the > > > > underlying implementation is forced to read all the data from the > > > > persistence storage"* > > > > > > > > > According to the javadoc, loadCache(...) method receives an optional > > > argument from the user. You can pass anything you like, including a > list > > of > > > keys, or an SQL where clause, etc. > > > > > > *"The partition-aware data loading approach is not a choice. It > requires > > > > persistence of the volatile data depended on affinity function > > > > implementation and settings."* > > > > > > > > > This is only partially true. While Ignite allows to plugin custom > > affinity > > > functions, the affinity function is not something that changes > > dynamically > > > and should always return the same partition for the same key.So, the > > > partition assignments are not volatile at all. If, in some very rare > > case, > > > the partition assignment logic needs to change, then you could update > the > > > partition assignments that you may have persisted elsewhere as well, > e.g. > > > database. > > > > > > D. > > > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov < > voze...@gridgain.com> > > > wrote: > > > > > > > Alexandr, Alexey, > > > > > > > > While I agree with you that current cache loading logic is far from > > > ideal, > > > > it would be cool to see API drafts based on your suggestions to get > > > better > > > > understanding of your ideas. How exactly users are going to use your > > > > suggestions? > > > > > > > > My main concern is that initial load is not very trivial task in > > general > > > > case. Some users have centralized RDBMS systems, some have NoSQL, > > others > > > > work with distributed persistent stores (e.g. HDFS). Sometimes we > have > > > > Ignite nodes "near" persistent data, sometimes we don't. Sharding, > > > > affinity, co-location, etc.. If we try to support all (or many) cases > > out > > > > of the box, we may end up in very messy and difficult API. So we > should > > > > carefully balance between simplicity, usability and feature-rich > > > > characteristics here. > > > > > > > > Personally, I think that if user is not satisfied with "loadCache()" > > API, > > > > he just writes simple closure with blackjack streamer and queries and > > > send > > > > it to whatever node he finds convenient. Not a big deal. Only very > > common > > > > cases should be added to Ignite API. > > > > > > > > Vladimir. > > > > > > > > > > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < > > > > akuznet...@gridgain.com> > > > > wrote: > > > > > > > > > Looks good for me. > > > > > > > > > > But I will suggest to consider one
Re: IgniteCache.loadCache improvement proposal
On Tue, Nov 15, 2016 at 9:07 AM, Yakov Zhdanovwrote: > As far as I can understand Alex was trying to avoid the scenario when user > needs to bring 1Tb dataset to each node of 50 nodes cluster and then > discard 49/50 of data loaded. For me this seems to be a very good catch. > Yakov, I agree that such scenario should be avoided. I also think that loadCache(...) method, as it is right now, provides a way to avoid it. DataStreamer also seems like an option here, but in this case, loadCache(...) method should not be used at all, to my understanding.
Re: IgniteCache.loadCache improvement proposal
Hi all, I think the discussion goes a wrong direction. Certainly it's not a big deal to implement some custom user logic to load the data into caches. But Ignite framework gives the user some reusable code build on top of the basic system. So the main question is: Why developers let the user to use convenient way to load caches with totally non-optimal solution? We could talk too much about different persistence storage types, but whenever we initiate the loading with IgniteCache.loadCache the current implementation imposes much overhead on the network. Partition-aware data loading may be used in some scenarios to avoid this network overhead, but the users are compelled to do additional steps to achieve this optimization: adding the column to tables, adding compound indices including the added column, write a peace of repeatable code to load the data in different caches in fault-tolerant fashion, etc. Let's give the user the reusable code which is convenient, reliable and fast. 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko < valentin.kuliche...@gmail.com>: > Hi Aleksandr, > > Data streamer is already outlined as one of the possible approaches for > loading the data [1]. Basically, you start a designated client node or > chose a leader among server nodes [1] and then use IgniteDataStreamer API > to load the data. With this approach there is no need to have the > CacheStore implementation at all. Can you please elaborate what additional > value are you trying to add here? > > [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer > [2] https://apacheignite.readme.io/docs/leader-election > > -Val > > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan> wrote: > > > Hi, > > > > I just want to clarify a couple of API details from the original email to > > make sure that we are making the right assumptions here. > > > > *"Because of none keys are passed to the CacheStore.loadCache methods, > the > > > underlying implementation is forced to read all the data from the > > > persistence storage"* > > > > > > According to the javadoc, loadCache(...) method receives an optional > > argument from the user. You can pass anything you like, including a list > of > > keys, or an SQL where clause, etc. > > > > *"The partition-aware data loading approach is not a choice. It requires > > > persistence of the volatile data depended on affinity function > > > implementation and settings."* > > > > > > This is only partially true. While Ignite allows to plugin custom > affinity > > functions, the affinity function is not something that changes > dynamically > > and should always return the same partition for the same key.So, the > > partition assignments are not volatile at all. If, in some very rare > case, > > the partition assignment logic needs to change, then you could update the > > partition assignments that you may have persisted elsewhere as well, e.g. > > database. > > > > D. > > > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov > > wrote: > > > > > Alexandr, Alexey, > > > > > > While I agree with you that current cache loading logic is far from > > ideal, > > > it would be cool to see API drafts based on your suggestions to get > > better > > > understanding of your ideas. How exactly users are going to use your > > > suggestions? > > > > > > My main concern is that initial load is not very trivial task in > general > > > case. Some users have centralized RDBMS systems, some have NoSQL, > others > > > work with distributed persistent stores (e.g. HDFS). Sometimes we have > > > Ignite nodes "near" persistent data, sometimes we don't. Sharding, > > > affinity, co-location, etc.. If we try to support all (or many) cases > out > > > of the box, we may end up in very messy and difficult API. So we should > > > carefully balance between simplicity, usability and feature-rich > > > characteristics here. > > > > > > Personally, I think that if user is not satisfied with "loadCache()" > API, > > > he just writes simple closure with blackjack streamer and queries and > > send > > > it to whatever node he finds convenient. Not a big deal. Only very > common > > > cases should be added to Ignite API. > > > > > > Vladimir. > > > > > > > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < > > > akuznet...@gridgain.com> > > > wrote: > > > > > > > Looks good for me. > > > > > > > > But I will suggest to consider one more use-case: > > > > > > > > If user knows its data he could manually split loading. > > > > For example: table Persons contains 10M rows. > > > > User could provide something like: > > > > cache.loadCache(null, "Person", "select * from Person where id < > > > > 1_000_000", > > > > "Person", "select * from Person where id >= 1_000_000 and id < > > > 2_000_000", > > > > > > > > "Person", "select * from Person where id >= 9_000_000 and id < > > > 10_000_000", > > > > ); > > > > > > > > or may be it could be some descriptor object like
Re: IgniteCache.loadCache improvement proposal
As far as I can understand Alex was trying to avoid the scenario when user needs to bring 1Tb dataset to each node of 50 nodes cluster and then discard 49/50 of data loaded. For me this seems to be a very good catch. However, I agree with Val that this may be implemented apart from store and user can continue using store for read/write-through and there is probably no need to alter any API. Maybe we need to outline Val's suggestion in the documentation and describe this as one of the possible scenarios. Thoughts? --Yakov
Re: IgniteCache.loadCache improvement proposal
Hi Aleksandr, Data streamer is already outlined as one of the possible approaches for loading the data [1]. Basically, you start a designated client node or chose a leader among server nodes [1] and then use IgniteDataStreamer API to load the data. With this approach there is no need to have the CacheStore implementation at all. Can you please elaborate what additional value are you trying to add here? [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer [2] https://apacheignite.readme.io/docs/leader-election -Val On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyanwrote: > Hi, > > I just want to clarify a couple of API details from the original email to > make sure that we are making the right assumptions here. > > *"Because of none keys are passed to the CacheStore.loadCache methods, the > > underlying implementation is forced to read all the data from the > > persistence storage"* > > > According to the javadoc, loadCache(...) method receives an optional > argument from the user. You can pass anything you like, including a list of > keys, or an SQL where clause, etc. > > *"The partition-aware data loading approach is not a choice. It requires > > persistence of the volatile data depended on affinity function > > implementation and settings."* > > > This is only partially true. While Ignite allows to plugin custom affinity > functions, the affinity function is not something that changes dynamically > and should always return the same partition for the same key.So, the > partition assignments are not volatile at all. If, in some very rare case, > the partition assignment logic needs to change, then you could update the > partition assignments that you may have persisted elsewhere as well, e.g. > database. > > D. > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov > wrote: > > > Alexandr, Alexey, > > > > While I agree with you that current cache loading logic is far from > ideal, > > it would be cool to see API drafts based on your suggestions to get > better > > understanding of your ideas. How exactly users are going to use your > > suggestions? > > > > My main concern is that initial load is not very trivial task in general > > case. Some users have centralized RDBMS systems, some have NoSQL, others > > work with distributed persistent stores (e.g. HDFS). Sometimes we have > > Ignite nodes "near" persistent data, sometimes we don't. Sharding, > > affinity, co-location, etc.. If we try to support all (or many) cases out > > of the box, we may end up in very messy and difficult API. So we should > > carefully balance between simplicity, usability and feature-rich > > characteristics here. > > > > Personally, I think that if user is not satisfied with "loadCache()" API, > > he just writes simple closure with blackjack streamer and queries and > send > > it to whatever node he finds convenient. Not a big deal. Only very common > > cases should be added to Ignite API. > > > > Vladimir. > > > > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < > > akuznet...@gridgain.com> > > wrote: > > > > > Looks good for me. > > > > > > But I will suggest to consider one more use-case: > > > > > > If user knows its data he could manually split loading. > > > For example: table Persons contains 10M rows. > > > User could provide something like: > > > cache.loadCache(null, "Person", "select * from Person where id < > > > 1_000_000", > > > "Person", "select * from Person where id >= 1_000_000 and id < > > 2_000_000", > > > > > > "Person", "select * from Person where id >= 9_000_000 and id < > > 10_000_000", > > > ); > > > > > > or may be it could be some descriptor object like > > > > > > { > > >sql: select * from Person where id >= ? and id < ?" > > >range: 0...10_000_000 > > > } > > > > > > In this case provided queries will be send to mach nodes as number of > > > queries. > > > And data will be loaded in parallel and for keys that a not local - > data > > > streamer > > > should be used (as described Alexandr description). > > > > > > I think it is a good issue for Ignite 2.0 > > > > > > Vova, Val - what do you think? > > > > > > > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin < > > ein.nsk...@gmail.com> > > > wrote: > > > > > >> All right, > > >> > > >> Let's assume a simple scenario. When the IgniteCache.loadCache is > > invoked, > > >> we check whether the cache is not local, and if so, then we'll > initiate > > >> the > > >> new loading logic. > > >> > > >> First, we take a "streamer" node, it could be done by > > >> utilizing LoadBalancingSpi, or it may be configured statically, for > the > > >> reason that the streamer node is running on the same host as the > > >> persistence storage provider. > > >> > > >> After that we start the loading task on the streamer node which > > >> creates IgniteDataStreamer and loads the cache with > > CacheStore.loadCache. > > >> Every call to IgniteBiInClosure.apply simply > > >>
Re: IgniteCache.loadCache improvement proposal
Hi, I just want to clarify a couple of API details from the original email to make sure that we are making the right assumptions here. *"Because of none keys are passed to the CacheStore.loadCache methods, the > underlying implementation is forced to read all the data from the > persistence storage"* According to the javadoc, loadCache(...) method receives an optional argument from the user. You can pass anything you like, including a list of keys, or an SQL where clause, etc. *"The partition-aware data loading approach is not a choice. It requires > persistence of the volatile data depended on affinity function > implementation and settings."* This is only partially true. While Ignite allows to plugin custom affinity functions, the affinity function is not something that changes dynamically and should always return the same partition for the same key.So, the partition assignments are not volatile at all. If, in some very rare case, the partition assignment logic needs to change, then you could update the partition assignments that you may have persisted elsewhere as well, e.g. database. D. On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerovwrote: > Alexandr, Alexey, > > While I agree with you that current cache loading logic is far from ideal, > it would be cool to see API drafts based on your suggestions to get better > understanding of your ideas. How exactly users are going to use your > suggestions? > > My main concern is that initial load is not very trivial task in general > case. Some users have centralized RDBMS systems, some have NoSQL, others > work with distributed persistent stores (e.g. HDFS). Sometimes we have > Ignite nodes "near" persistent data, sometimes we don't. Sharding, > affinity, co-location, etc.. If we try to support all (or many) cases out > of the box, we may end up in very messy and difficult API. So we should > carefully balance between simplicity, usability and feature-rich > characteristics here. > > Personally, I think that if user is not satisfied with "loadCache()" API, > he just writes simple closure with blackjack streamer and queries and send > it to whatever node he finds convenient. Not a big deal. Only very common > cases should be added to Ignite API. > > Vladimir. > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov < > akuznet...@gridgain.com> > wrote: > > > Looks good for me. > > > > But I will suggest to consider one more use-case: > > > > If user knows its data he could manually split loading. > > For example: table Persons contains 10M rows. > > User could provide something like: > > cache.loadCache(null, "Person", "select * from Person where id < > > 1_000_000", > > "Person", "select * from Person where id >= 1_000_000 and id < > 2_000_000", > > > > "Person", "select * from Person where id >= 9_000_000 and id < > 10_000_000", > > ); > > > > or may be it could be some descriptor object like > > > > { > >sql: select * from Person where id >= ? and id < ?" > >range: 0...10_000_000 > > } > > > > In this case provided queries will be send to mach nodes as number of > > queries. > > And data will be loaded in parallel and for keys that a not local - data > > streamer > > should be used (as described Alexandr description). > > > > I think it is a good issue for Ignite 2.0 > > > > Vova, Val - what do you think? > > > > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin < > ein.nsk...@gmail.com> > > wrote: > > > >> All right, > >> > >> Let's assume a simple scenario. When the IgniteCache.loadCache is > invoked, > >> we check whether the cache is not local, and if so, then we'll initiate > >> the > >> new loading logic. > >> > >> First, we take a "streamer" node, it could be done by > >> utilizing LoadBalancingSpi, or it may be configured statically, for the > >> reason that the streamer node is running on the same host as the > >> persistence storage provider. > >> > >> After that we start the loading task on the streamer node which > >> creates IgniteDataStreamer and loads the cache with > CacheStore.loadCache. > >> Every call to IgniteBiInClosure.apply simply > >> invokes IgniteDataStreamer.addData. > >> > >> This implementation will completely relieve overhead on the persistence > >> storage provider. Network overhead is also decreased in the case of > >> partitioned caches. For two nodes we get 1-1/2 amount of data > transferred > >> by the network (1 part well be transferred from the persistence storage > to > >> the streamer, and then 1/2 from the streamer node to the another node). > >> For > >> three nodes it will be 1-2/3 and so on, up to the two times amount of > data > >> on the big clusters. > >> > >> I'd like to propose some additional optimization at this place. If we > have > >> the streamer node on the same machine as the persistence storage > provider, > >> then we completely relieve the network overhead as well. It could be a > >> some > >> special daemon node for the cache loading assigned
Re: IgniteCache.loadCache improvement proposal
Alexandr, Alexey, While I agree with you that current cache loading logic is far from ideal, it would be cool to see API drafts based on your suggestions to get better understanding of your ideas. How exactly users are going to use your suggestions? My main concern is that initial load is not very trivial task in general case. Some users have centralized RDBMS systems, some have NoSQL, others work with distributed persistent stores (e.g. HDFS). Sometimes we have Ignite nodes "near" persistent data, sometimes we don't. Sharding, affinity, co-location, etc.. If we try to support all (or many) cases out of the box, we may end up in very messy and difficult API. So we should carefully balance between simplicity, usability and feature-rich characteristics here. Personally, I think that if user is not satisfied with "loadCache()" API, he just writes simple closure with blackjack streamer and queries and send it to whatever node he finds convenient. Not a big deal. Only very common cases should be added to Ignite API. Vladimir. On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsovwrote: > Looks good for me. > > But I will suggest to consider one more use-case: > > If user knows its data he could manually split loading. > For example: table Persons contains 10M rows. > User could provide something like: > cache.loadCache(null, "Person", "select * from Person where id < > 1_000_000", > "Person", "select * from Person where id >= 1_000_000 and id < 2_000_000", > > "Person", "select * from Person where id >= 9_000_000 and id < 10_000_000", > ); > > or may be it could be some descriptor object like > > { >sql: select * from Person where id >= ? and id < ?" >range: 0...10_000_000 > } > > In this case provided queries will be send to mach nodes as number of > queries. > And data will be loaded in parallel and for keys that a not local - data > streamer > should be used (as described Alexandr description). > > I think it is a good issue for Ignite 2.0 > > Vova, Val - what do you think? > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin > wrote: > >> All right, >> >> Let's assume a simple scenario. When the IgniteCache.loadCache is invoked, >> we check whether the cache is not local, and if so, then we'll initiate >> the >> new loading logic. >> >> First, we take a "streamer" node, it could be done by >> utilizing LoadBalancingSpi, or it may be configured statically, for the >> reason that the streamer node is running on the same host as the >> persistence storage provider. >> >> After that we start the loading task on the streamer node which >> creates IgniteDataStreamer and loads the cache with CacheStore.loadCache. >> Every call to IgniteBiInClosure.apply simply >> invokes IgniteDataStreamer.addData. >> >> This implementation will completely relieve overhead on the persistence >> storage provider. Network overhead is also decreased in the case of >> partitioned caches. For two nodes we get 1-1/2 amount of data transferred >> by the network (1 part well be transferred from the persistence storage to >> the streamer, and then 1/2 from the streamer node to the another node). >> For >> three nodes it will be 1-2/3 and so on, up to the two times amount of data >> on the big clusters. >> >> I'd like to propose some additional optimization at this place. If we have >> the streamer node on the same machine as the persistence storage provider, >> then we completely relieve the network overhead as well. It could be a >> some >> special daemon node for the cache loading assigned in the cache >> configuration, or an ordinary sever node as well. >> >> Certainly this calculations have been done in assumption that we have even >> partitioned cache with only primary nodes (without backups). In the case >> of >> one backup (the most frequent case I think), we get 2 amount of data >> transferred by the network on two nodes, 2-1/3 on three, 2-1/2 on four, >> and >> so on up to the three times amount of data on the big clusters. Hence it's >> still better than the current implementation. In the worst case with a >> fully replicated cache we take N+1 amount of data transferred by the >> network (where N is the number of nodes in the cluster). But it's not a >> problem in small clusters, and a little overhead in big clusters. And we >> still gain the persistence storage provider optimization. >> >> Now let's take more complex scenario. To achieve some level of >> parallelism, >> we could split our cluster on several groups. It could be a parameter of >> the IgniteCache.loadCache method or a cache configuration option. The >> number of groups could be a fixed value, or it could be calculated >> dynamically by the maximum number of nodes in the group. >> >> After splitting the whole cluster on groups we will take the streamer node >> in the each group and submit the task for loading the cache similar to the >> single streamer scenario, except as the only keys will
Re: IgniteCache.loadCache improvement proposal
Looks good for me. But I will suggest to consider one more use-case: If user knows its data he could manually split loading. For example: table Persons contains 10M rows. User could provide something like: cache.loadCache(null, "Person", "select * from Person where id < 1_000_000", "Person", "select * from Person where id >= 1_000_000 and id < 2_000_000", "Person", "select * from Person where id >= 9_000_000 and id < 10_000_000", ); or may be it could be some descriptor object like { sql: select * from Person where id >= ? and id < ?" range: 0...10_000_000 } In this case provided queries will be send to mach nodes as number of queries. And data will be loaded in parallel and for keys that a not local - data streamer should be used (as described Alexandr description). I think it is a good issue for Ignite 2.0 Vova, Val - what do you think? On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshinwrote: > All right, > > Let's assume a simple scenario. When the IgniteCache.loadCache is invoked, > we check whether the cache is not local, and if so, then we'll initiate the > new loading logic. > > First, we take a "streamer" node, it could be done by > utilizing LoadBalancingSpi, or it may be configured statically, for the > reason that the streamer node is running on the same host as the > persistence storage provider. > > After that we start the loading task on the streamer node which > creates IgniteDataStreamer and loads the cache with CacheStore.loadCache. > Every call to IgniteBiInClosure.apply simply > invokes IgniteDataStreamer.addData. > > This implementation will completely relieve overhead on the persistence > storage provider. Network overhead is also decreased in the case of > partitioned caches. For two nodes we get 1-1/2 amount of data transferred > by the network (1 part well be transferred from the persistence storage to > the streamer, and then 1/2 from the streamer node to the another node). For > three nodes it will be 1-2/3 and so on, up to the two times amount of data > on the big clusters. > > I'd like to propose some additional optimization at this place. If we have > the streamer node on the same machine as the persistence storage provider, > then we completely relieve the network overhead as well. It could be a some > special daemon node for the cache loading assigned in the cache > configuration, or an ordinary sever node as well. > > Certainly this calculations have been done in assumption that we have even > partitioned cache with only primary nodes (without backups). In the case of > one backup (the most frequent case I think), we get 2 amount of data > transferred by the network on two nodes, 2-1/3 on three, 2-1/2 on four, and > so on up to the three times amount of data on the big clusters. Hence it's > still better than the current implementation. In the worst case with a > fully replicated cache we take N+1 amount of data transferred by the > network (where N is the number of nodes in the cluster). But it's not a > problem in small clusters, and a little overhead in big clusters. And we > still gain the persistence storage provider optimization. > > Now let's take more complex scenario. To achieve some level of parallelism, > we could split our cluster on several groups. It could be a parameter of > the IgniteCache.loadCache method or a cache configuration option. The > number of groups could be a fixed value, or it could be calculated > dynamically by the maximum number of nodes in the group. > > After splitting the whole cluster on groups we will take the streamer node > in the each group and submit the task for loading the cache similar to the > single streamer scenario, except as the only keys will be passed to > the IgniteDataStreamer.addData method those correspond to the cluster group > where is the streamer node running. > > In this case we get equal level of overhead as the parallelism, but not so > surplus as how many nodes in whole the cluster. > > 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov : > > > Alexandr, > > > > Could you describe your proposal in more details? > > Especially in case with several nodes. > > > > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin < > ein.nsk...@gmail.com> > > wrote: > > > > > Hi, > > > > > > You know CacheStore API that is commonly used for read/write-through > > > relationship of the in-memory data with the persistence storage. > > > > > > There is also IgniteCache.loadCache method for hot-loading the cache on > > > startup. Invocation of this method causes execution of > > CacheStore.loadCache > > > on the all nodes storing the cache partitions. Because of none keys are > > > passed to the CacheStore.loadCache methods, the underlying > implementation > > > is forced to read all the data from the persistence storage, but only > > part > > > of the data will be stored on each node. > > > > > > So, the current implementation have two general drawbacks: > > > > > >
Re: IgniteCache.loadCache improvement proposal
All right, Let's assume a simple scenario. When the IgniteCache.loadCache is invoked, we check whether the cache is not local, and if so, then we'll initiate the new loading logic. First, we take a "streamer" node, it could be done by utilizing LoadBalancingSpi, or it may be configured statically, for the reason that the streamer node is running on the same host as the persistence storage provider. After that we start the loading task on the streamer node which creates IgniteDataStreamer and loads the cache with CacheStore.loadCache. Every call to IgniteBiInClosure.apply simply invokes IgniteDataStreamer.addData. This implementation will completely relieve overhead on the persistence storage provider. Network overhead is also decreased in the case of partitioned caches. For two nodes we get 1-1/2 amount of data transferred by the network (1 part well be transferred from the persistence storage to the streamer, and then 1/2 from the streamer node to the another node). For three nodes it will be 1-2/3 and so on, up to the two times amount of data on the big clusters. I'd like to propose some additional optimization at this place. If we have the streamer node on the same machine as the persistence storage provider, then we completely relieve the network overhead as well. It could be a some special daemon node for the cache loading assigned in the cache configuration, or an ordinary sever node as well. Certainly this calculations have been done in assumption that we have even partitioned cache with only primary nodes (without backups). In the case of one backup (the most frequent case I think), we get 2 amount of data transferred by the network on two nodes, 2-1/3 on three, 2-1/2 on four, and so on up to the three times amount of data on the big clusters. Hence it's still better than the current implementation. In the worst case with a fully replicated cache we take N+1 amount of data transferred by the network (where N is the number of nodes in the cluster). But it's not a problem in small clusters, and a little overhead in big clusters. And we still gain the persistence storage provider optimization. Now let's take more complex scenario. To achieve some level of parallelism, we could split our cluster on several groups. It could be a parameter of the IgniteCache.loadCache method or a cache configuration option. The number of groups could be a fixed value, or it could be calculated dynamically by the maximum number of nodes in the group. After splitting the whole cluster on groups we will take the streamer node in the each group and submit the task for loading the cache similar to the single streamer scenario, except as the only keys will be passed to the IgniteDataStreamer.addData method those correspond to the cluster group where is the streamer node running. In this case we get equal level of overhead as the parallelism, but not so surplus as how many nodes in whole the cluster. 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov: > Alexandr, > > Could you describe your proposal in more details? > Especially in case with several nodes. > > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin > wrote: > > > Hi, > > > > You know CacheStore API that is commonly used for read/write-through > > relationship of the in-memory data with the persistence storage. > > > > There is also IgniteCache.loadCache method for hot-loading the cache on > > startup. Invocation of this method causes execution of > CacheStore.loadCache > > on the all nodes storing the cache partitions. Because of none keys are > > passed to the CacheStore.loadCache methods, the underlying implementation > > is forced to read all the data from the persistence storage, but only > part > > of the data will be stored on each node. > > > > So, the current implementation have two general drawbacks: > > > > 1. Persistence storage is forced to perform as many identical queries as > > many nodes on the cluster. Each query may involve much additional > > computation on the persistence storage server. > > > > 2. Network is forced to transfer much more data, so obviously the big > > disadvantage on large systems. > > > > The partition-aware data loading approach, described in > > https://apacheignite.readme.io/docs/data-loading#section- > > partition-aware-data-loading > > , is not a choice. It requires persistence of the volatile data depended > on > > affinity function implementation and settings. > > > > I propose using something like IgniteDataStreamer inside > > IgniteCache.loadCache implementation. > > > > > > -- > > Thanks, > > Alexandr Kuramshin > > > > > > -- > Alexey Kuznetsov > -- Thanks, Alexandr Kuramshin
Re: IgniteCache.loadCache improvement proposal
Alexandr, Could you describe your proposal in more details? Especially in case with several nodes. On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshinwrote: > Hi, > > You know CacheStore API that is commonly used for read/write-through > relationship of the in-memory data with the persistence storage. > > There is also IgniteCache.loadCache method for hot-loading the cache on > startup. Invocation of this method causes execution of CacheStore.loadCache > on the all nodes storing the cache partitions. Because of none keys are > passed to the CacheStore.loadCache methods, the underlying implementation > is forced to read all the data from the persistence storage, but only part > of the data will be stored on each node. > > So, the current implementation have two general drawbacks: > > 1. Persistence storage is forced to perform as many identical queries as > many nodes on the cluster. Each query may involve much additional > computation on the persistence storage server. > > 2. Network is forced to transfer much more data, so obviously the big > disadvantage on large systems. > > The partition-aware data loading approach, described in > https://apacheignite.readme.io/docs/data-loading#section- > partition-aware-data-loading > , is not a choice. It requires persistence of the volatile data depended on > affinity function implementation and settings. > > I propose using something like IgniteDataStreamer inside > IgniteCache.loadCache implementation. > > > -- > Thanks, > Alexandr Kuramshin > -- Alexey Kuznetsov
IgniteCache.loadCache improvement proposal
Hi, You know CacheStore API that is commonly used for read/write-through relationship of the in-memory data with the persistence storage. There is also IgniteCache.loadCache method for hot-loading the cache on startup. Invocation of this method causes execution of CacheStore.loadCache on the all nodes storing the cache partitions. Because of none keys are passed to the CacheStore.loadCache methods, the underlying implementation is forced to read all the data from the persistence storage, but only part of the data will be stored on each node. So, the current implementation have two general drawbacks: 1. Persistence storage is forced to perform as many identical queries as many nodes on the cluster. Each query may involve much additional computation on the persistence storage server. 2. Network is forced to transfer much more data, so obviously the big disadvantage on large systems. The partition-aware data loading approach, described in https://apacheignite.readme.io/docs/data-loading#section-partition-aware-data-loading , is not a choice. It requires persistence of the volatile data depended on affinity function implementation and settings. I propose using something like IgniteDataStreamer inside IgniteCache.loadCache implementation. -- Thanks, Alexandr Kuramshin