[jira] [Comment Edited] (SOLR-13439) Make collection properties easier and safer to use in code
[ https://issues.apache.org/jira/browse/SOLR-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861095#comment-16861095 ] Gus Heck edited comment on SOLR-13439 at 6/11/19 2:29 PM: -- Given that this has been well discussed and the existence of the api has basically been agreed upon, I'm moving this forward so I can build on it, and if we need to tweak something about the details of the implementation we can still do that. Will be pushing to 8x shortly. was (Author: gus_heck): Given that this has been well discussed and the existence of the api has basically been agreed upon, I'm moving this forward so I can build on it, and if we need to tweak something about the details of the implementation we can still do that. > Make collection properties easier and safer to use in code > -- > > Key: SOLR-13439 > URL: https://issues.apache.org/jira/browse/SOLR-13439 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Affects Versions: master (9.0) >Reporter: Gus Heck >Assignee: Gus Heck >Priority: Major > Attachments: SOLR-13439.patch, SOLR-13439.patch, SOLR-13439.patch > > > (breaking this out from SOLR-13420, please read there for further background) > Before this patch the api is quite confusing (IMHO): > # any code that wanted to know what the properties for a collection are > could call zkStateReader.getCollectionProperties(collection) but this was a > dangerous and trappy API because that was a query to zookeeper every time. If > a naive user auto-completed that in their IDE without investigating, heavy > use of zookeeper would ensue. > # To "do it right" for any code that might get called on a per-doc or per > request basis one had to cause caching by registering a watcher. At which > point the getCollectionProperties(collection) magically becomes safe to use, > but the watcher pattern probably looks famillar induces a user who hasn't > read the solr code closely to create their own cache and update it when their > watcher is notified. If the caching side effect of watches isn't understood > this will lead to many in-memory copies of collection properties maintained > in user code. > # This also creates a task to be scheduled on a thread (PropsNotification) > and induces an extra thread-scheduling lag before the changes can be observed > by user code. > # The code that cares about collection properties needs to have a lifecycle > tied to either a collection or something other object with an even more > ephemeral life cycle such as an URP. The user now also has to remember to > ensure the watch is unregistered, or there is a leak. > After this patch > # Calls to getCollectionProperties(collection) are always safe to use in any > code anywhere. Caching and cleanup are automatic. > # Code that really actually wants to know if a collection property changes > so it can wake up and do something (autoscaling?) still has the option of > registering a watcher that will asynchronously send them a notification. > # Updates can be observed sooner via getCollectionProperties with no need to > wait for a thread to run. (vs a cache held in user code) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-13439) Make collection properties easier and safer to use in code
[ https://issues.apache.org/jira/browse/SOLR-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836424#comment-16836424 ] Gus Heck edited comment on SOLR-13439 at 5/9/19 2:39 PM: - {quote}Case 1 (Frequent access): 1. T+0: SolrCore starts (i.e. A replica is added, a collection is created, you name it) 2. T+0: On initialization, a component that relies on collection properties registers a listener. Solr reads from ZooKeeper once. 3. T+0 to T+end-of-life-of-the-core: The watch remains. 4. T+end-of-life-of-the-core: no more reads are expected, the watch is removed. {quote} This identical to my Case 1 if the core lives for 30 minutes? {quote}Times read with current approach: 1 + Number of modifications of properties. Times with the cache approach: unknown, at least the same as with the listener, but depends on when the properties are accessed. {quote} Not true. One (or possibly zero reads if we luck into my case 3) on step 1 (instead of step 2 as you list it)... Properties remain cached (with updates when they change in zk of course) until core is removed. This is the whole point of why I wrote ConditionaExpiringCache. While the watch is active, it will exist in collectionPropsWatches, and the predicate the cache is created with will evaluate to true and the timeout will be refreshed without removing the entry {code:java} if (condition.test(peek)) { // we passed our stay alive condition refresh our time-out and put back in the queue. peek.expireAt = System.nanoTime() + retainForNanos; timeSortedEntries.add(peek); } else { // we are past our time limit, and have failed our condition to stay alive so remove from cache cache.remove(peek.key); } {code} I expect that the typo you caught and I corrected in the latest version may have thrown you off of realizing that the cache never expires while a watch is set. In that respect my patch is backwards compatible. {quote}I can give you two cases: Case 1: Collection properties are accessed infrequently (like in my “case 2 above”), but collection properties change frequently (i.e. every second) 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on cache 2. T + 1 to T + 9: Collection properties changes, fires watches to Solr. Solr receives the watch and reads from Zookeeper 3. T + 10 cache expires With cache, we read from Zookeeper 10 times, and ZooKeeper fires 10 watches. Without cache, we read once, ZooKeeper doens't fire any watch. Keep in mind that some clusters may have many collections (hundreds/thousands?), this may add a lot of load to ZooKeeper for things that aren’t going to be needed. {quote} Ah yes, I was evaluating in terms of reads caused by user action. I have assumed (I state it in one of the comments in code I think) that collection properties are not frequently updated. What you say is true, but it's no worse than having a watch set. I am amenable to making the expiration time tunable via configuration if you think frequent updates to collection properties for an individual collection are a realistic use case. My assumption is that they are likely to be set at initialization, and perhaps changed when an admin or autoscalling wants to adjust something. If something is thrashing collection properties I think that's the problem and this would be a symptom. So to summarize this case: *TLDR;*This patch does pose a risk for systems with large numbers of collections that also frequently update collection properties on many of those collections if that system was +not already using collectionPropertiesWatches+ (or only using them on a few collections) prior to this patch (i.e. most updates to properties are ignored). That seems like a pretty narrow use case, since it implies that the majority of updates are going un-read unless they are making frequent calls to getcollectionProperties() and perhaps should have been setting a watch in the first place... This case (which I thought of but discarded as very unlikely) could be mitigated by shortening the cache expiration time and the cache eviction frequency interval here, or making them configurable. {code:java} // ten minutes -- possibly make this configurable private static final long CACHE_COLLECTION_PROPS_FOR_NANOS = 10L * 60 * 1000 * 1000 * 1000; // one minute -- possibly make this configurable public static final int CACHE_COLLECTION_PROPS_REAPER_INTERVAL = 6; {code} Before we do that however, are we really meaning to support use cases where people thrash values we store in zookeeper? Or is this an anti-pattern like individual requests for updates rather than batching? My gut says the later, but maybe I'm out in left field on that? {quote}Case 2: A component
[jira] [Comment Edited] (SOLR-13439) Make collection properties easier and safer to use in code
[ https://issues.apache.org/jira/browse/SOLR-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836424#comment-16836424 ] Gus Heck edited comment on SOLR-13439 at 5/9/19 2:36 PM: - {quote}Case 1 (Frequent access): 1. T+0: SolrCore starts (i.e. A replica is added, a collection is created, you name it) 2. T+0: On initialization, a component that relies on collection properties registers a listener. Solr reads from ZooKeeper once. 3. T+0 to T+end-of-life-of-the-core: The watch remains. 4. T+end-of-life-of-the-core: no more reads are expected, the watch is removed. {quote} This identical to my Case 1 if the core lives for 30 minutes? {quote}Times read with current approach: 1 + Number of modifications of properties. Times with the cache approach: unknown, at least the same as with the listener, but depends on when the properties are accessed. {quote} Not true. One (or possibly zero reads if we luck into my case 3) on step 1 (instead of step 2 as you list it)... Properties remain cached (with updates when they change in zk of course) until core is removed. This is the whole point of why I wrote ConditionaExpiringCache. While the watch is active, it will exist in collectionPropsWatches, and the predicate the cache is created with will evaluate to true and the timeout will be refreshed without removing the entry {code:java} if (condition.test(peek)) { // we passed our stay alive condition refresh our time-out and put back in the queue. peek.expireAt = System.nanoTime() + retainForNanos; timeSortedEntries.add(peek); } else { // we are past our time limit, and have failed our condition to stay alive so remove from cache cache.remove(peek.key); } {code} I expect that the typo you caught and I corrected in the latest version may have thrown you off of realizing that the cache never expires while a watch is set. In that respect my patch is backwards compatible. {quote}I can give you two cases: Case 1: Collection properties are accessed infrequently (like in my “case 2 above”), but collection properties change frequently (i.e. every second) 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on cache 2. T + 1 to T + 9: Collection properties changes, fires watches to Solr. Solr receives the watch and reads from Zookeeper 3. T + 10 cache expires With cache, we read from Zookeeper 10 times, and ZooKeeper fires 10 watches. Without cache, we read once, ZooKeeper doens't fire any watch. Keep in mind that some clusters may have many collections (hundreds/thousands?), this may add a lot of load to ZooKeeper for things that aren’t going to be needed. {quote} Ah yes, I was evaluating in terms of reads caused by user action. I have assumed (I state it in one of the comments in code I think) that collection properties are not frequently updated. What you say is true, but it's no worse than having a watch set. I am amenable to making the expiration time tunable via configuration if you think frequent updates to collection properties for an individual collection are a realistic use case. My assumption is that they are likely to be set at initialization, and perhaps changed when an admin or autoscalling wants to adjust something. If something is thrashing collection properties I think that's the problem and this would be a symptom. So to summarize this case: *TLDR;*This patch does pose a risk for systems with large numbers of collections that also frequently update collection properties on many of those collections if that system was +not already using collectionPropertiesWatches+ (or only using them on a few collections) prior to this patch (i.e. most updates to properties are ignored). That seems like a pretty narrow use case, since it implies that the majority of updates are going un-read unless they are making frequent calls to getcollectionProperties() and perhaps should have been setting a watch in the first place... This case (which I thought of but discarded as very unlikely) could be mitigated by shortening the cache expiration time and the cache eviction frequency interval here, or making them configurable. {code:java} // ten minutes -- possibly make this configurable private static final long CACHE_COLLECTION_PROPS_FOR_NANOS = 10L * 60 * 1000 * 1000 * 1000; // one minute -- possibly make this configurable public static final int CACHE_COLLECTION_PROPS_REAPER_INTERVAL = 6; {code} Before we do that however, are we really meaning to support use cases where people thrash values we store in zookeeper? Or is this an anti-pattern like individual requests for updates rather than batching? My gut says the later, but maybe I'm out in left field on that? {quote}Case 2: A component
[jira] [Comment Edited] (SOLR-13439) Make collection properties easier and safer to use in code
[ https://issues.apache.org/jira/browse/SOLR-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835757#comment-16835757 ] Gus Heck edited comment on SOLR-13439 at 5/8/19 10:23 PM: -- Need was probably the wrong word... want :) would have been more correct. I prefer that since then it can be a single line and we don't have to do the cache lookup for an object that we are already handling. As for consistency/frequency of access let's consider some cases (time in minutes): # Case 1: ## T+0 A watch is set, cache is populated from zk ## T+30 watch is unregistered ## T+40 cache expires ## In this case zookeeper is accessed exactly once.This is UNCHANGED vs current # Case 2: ## T+0 A watch is set, cache is populated from zk ## T+15 properties are accessed with getCollectionProperties() ## T+30 the watch is unregistered ## T+40 cache expires ## In this case zookeeper is accessed exactly once.This is UNCHANGED vs current # Case 3: ## T+0 a call to getCollectionProperties() is made, cache is populated from zk ## T+5 a watch is set, cache is already populated, zk is not accessed ## T+30 watch is unregistered ## T+40 cache expires ## In this case the data access frequency for zk is once instead of twice. This LESS than current. # Case 4: ## T+0 a call to getCollectionProperties() is made, cache is populated from zk ## T+10 the cache expires ## T+20 a watch is set, cache is cache is populated from zk ## T+30 watch is unregistered ## T+40 cache expires ## In this case zookeeper is accessed twice. This is UNCHANGED vs current. # Case 5: ## T+0 a call to getCollectionProperties() is made, cache is populated from zk ## T+1 a call to getCollectionProperties() is made, cache is already populated, zk is not accessed ## T+2 a call to getCollectionProperties() is made, cache is already populated, zk is not accessed ## T+12 the cache expires ## T+30 a call to getCollectionProperties() is made, cache is populated from zk ## T+40 the cache expires ## In this case zookeeper is accessed twice instead of four times This is LESS than current. I will grant you that it's hard to predict that when the load on zookeeper will be less, but in no case will it be more, so unless you mean to stress test zookeeper this is not much of a problem. If I've missed a case let me know. The existing code (without this patch) isn't really consistent either. Code that calls getCollectionProperties() is either fast or slow depending on whether or not someone else has set a watch on that collection. I think the current inconsistency is worse because something that was fast due to the action of unrelated code (getCollectionProperties()) can become surprisingly slow, whereas after my patch, setting the watch may become surprisingly fast due to the effect of unrelated code. Edit: re-reading that last sentence I realize now that one could flip either one around but I guess what I meant to express is that the delta is potentially much larger in the present code since it would be almost inconceivable to create a watch on a per-doc basis, but one could easily write code checking collection properties on a per-doc or per request basis. was (Author: gus_heck): Need was probably the wrong word... want :) would have been more correct. I prefer that since then it can be a single line and we don't have to do the cache lookup for an object that we are already handling. As for consistency/frequency of access let's consider some cases (time in minutes): # Case 1: ## T+0 A watch is set, cache is populated from zk ## T+30 watch is unregistered ## T+40 cache expires ## In this case zookeeper is accessed exactly once.This is UNCHANGED vs current # Case 2: ## T+0 A watch is set, cache is populated from zk ## T+15 properties are accessed with getCollectionProperties() ## T+30 the watch is unregistered ## T+40 cache expires ## In this case zookeeper is accessed exactly once.This is UNCHANGED vs current # Case 3: ## T+0 a call to getCollectionProperties() is made, cache is populated from zk ## T+5 a watch is set, cache is already populated, zk is not accessed ## T+30 watch is unregistered ## T+40 cache expires ## In this case the data access frequency for zk is once instead of twice. This LESS than current. # Case 4: ## T+0 a call to getCollectionProperties() is made, cache is populated from zk ## T+10 the cache expires ## T+20 a watch is set, cache is cache is populated from zk ## T+30 watch is unregistered ## T+40 cache expires ## In this case zookeeper is accessed twice. This is UNCHANGED vs current. # Case 5: ## T+0 a call to getCollectionProperties() is made, cache is populated from zk ## T+1 a call to getCollectionProperties() is made, cache is already populated, zk is not accessed ## T+2 a call to getCollectionProperties() is made, cache is already populated, zk is not accessed ##