[ https://issues.apache.org/jira/browse/SOLR-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836424#comment-16836424 ]
Gus Heck edited comment on SOLR-13439 at 5/9/19 2:39 PM: --------------------------------------------------------- {quote}Case 1 (Frequent access): 1. T+0: SolrCore starts (i.e. A replica is added, a collection is created, you name it) 2. T+0: On initialization, a component that relies on collection properties registers a listener. Solr reads from ZooKeeper once. 3. T+0 to T+end-of-life-of-the-core: The watch remains. 4. T+end-of-life-of-the-core: no more reads are expected, the watch is removed. {quote} This identical to my Case 1 if the core lives for 30 minutes? {quote}Times read with current approach: 1 + Number of modifications of properties. Times with the cache approach: unknown, at least the same as with the listener, but depends on when the properties are accessed. {quote} Not true. One (or possibly zero reads if we luck into my case 3) on step 1 (instead of step 2 as you list it)... Properties remain cached (with updates when they change in zk of course) until core is removed. This is the whole point of why I wrote ConditionaExpiringCache. While the watch is active, it will exist in collectionPropsWatches, and the predicate the cache is created with will evaluate to true and the timeout will be refreshed without removing the entry {code:java} if (condition.test(peek)) { // we passed our stay alive condition refresh our time-out and put back in the queue. peek.expireAt = System.nanoTime() + retainForNanos; timeSortedEntries.add(peek); } else { // we are past our time limit, and have failed our condition to stay alive so remove from cache cache.remove(peek.key); } {code} I expect that the typo you caught and I corrected in the latest version may have thrown you off of realizing that the cache never expires while a watch is set. In that respect my patch is backwards compatible. {quote}I can give you two cases: Case 1: Collection properties are accessed infrequently (like in my “case 2 above”), but collection properties change frequently (i.e. every second) 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on cache 2. T + 1 to T + 9: Collection properties changes, fires watches to Solr. Solr receives the watch and reads from Zookeeper 3. T + 10 cache expires With cache, we read from Zookeeper 10 times, and ZooKeeper fires 10 watches. Without cache, we read once, ZooKeeper doens't fire any watch. Keep in mind that some clusters may have many collections (hundreds/thousands?), this may add a lot of load to ZooKeeper for things that aren’t going to be needed. {quote} Ah yes, I was evaluating in terms of reads caused by user action. I have assumed (I state it in one of the comments in code I think) that collection properties are not frequently updated. What you say is true, but it's no worse than having a watch set. I am amenable to making the expiration time tunable via configuration if you think frequent updates to collection properties for an individual collection are a realistic use case. My assumption is that they are likely to be set at initialization, and perhaps changed when an admin or autoscalling wants to adjust something. If something is thrashing collection properties I think that's the problem and this would be a symptom. So to summarize this case: *TLDR;*This patch does pose a risk for systems with large numbers of collections that also frequently update collection properties on many of those collections if that system was +not already using collectionPropertiesWatches+ (or only using them on a few collections) prior to this patch (i.e. most updates to properties are ignored). That seems like a pretty narrow use case, since it implies that the majority of updates are going un-read unless they are making frequent calls to getcollectionProperties() and perhaps should have been setting a watch in the first place... This case (which I thought of but discarded as very unlikely) could be mitigated by shortening the cache expiration time and the cache eviction frequency interval here, or making them configurable. {code:java} // ten minutes -- possibly make this configurable private static final long CACHE_COLLECTION_PROPS_FOR_NANOS = 10L * 60 * 1000 * 1000 * 1000; // one minute -- possibly make this configurable public static final int CACHE_COLLECTION_PROPS_REAPER_INTERVAL = 60000; {code} Before we do that however, are we really meaning to support use cases where people thrash values we store in zookeeper? Or is this an anti-pattern like individual requests for updates rather than batching? My gut says the later, but maybe I'm out in left field on that? {quote}Case 2: A component doesn’t rely on a listener, but relies on cache. 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on cache 2. T + 10, cache expires 3. T + 11: call to getCollectionProperties(), Zk watch is set and element is on cache 4. T + 20, cache expires 5. … With a listener, this is just one read. With cache, this is, again, unknown, but up to N, the number of calls to getCollectionProperties() {quote} This is identical to my case 5, and no change vs the current code which would read exactly N times instead of up to N times, and if the system can't handle a call to zookeeper every 10 minutes (once every 0.6 seconds with a thousand collections), then this is the veritable straw the broke the camel's back, and the solution is to use more camels, or carry a lighter load rather rather than live in mortal fear of a stray bit of straw. Again if this is expected to pose a problem, (again only in systems that don't have watches already) increasing the timeout duration mitigates it. was (Author: gus_heck): {quote}Case 1 (Frequent access): 1. T+0: SolrCore starts (i.e. A replica is added, a collection is created, you name it) 2. T+0: On initialization, a component that relies on collection properties registers a listener. Solr reads from ZooKeeper once. 3. T+0 to T+end-of-life-of-the-core: The watch remains. 4. T+end-of-life-of-the-core: no more reads are expected, the watch is removed. {quote} This identical to my Case 1 if the core lives for 30 minutes? {quote}Times read with current approach: 1 + Number of modifications of properties. Times with the cache approach: unknown, at least the same as with the listener, but depends on when the properties are accessed. {quote} Not true. One (or possibly zero reads if we luck into my case 3) on step 1 (instead of step 2 as you list it)... Properties remain cached (with updates when they change in zk of course) until core is removed. This is the whole point of why I wrote ConditionaExpiringCache. While the watch is active, it will exist in collectionPropsWatches, and the predicate the cache is created with will evaluate to true and the timeout will be refreshed without removing the entry {code:java} if (condition.test(peek)) { // we passed our stay alive condition refresh our time-out and put back in the queue. peek.expireAt = System.nanoTime() + retainForNanos; timeSortedEntries.add(peek); } else { // we are past our time limit, and have failed our condition to stay alive so remove from cache cache.remove(peek.key); } {code} I expect that the typo you caught and I corrected in the latest version may have thrown you off of realizing that the cache never expires while a watch is set. In that respect my patch is backwards compatible. {quote}I can give you two cases: Case 1: Collection properties are accessed infrequently (like in my “case 2 above”), but collection properties change frequently (i.e. every second) 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on cache 2. T + 1 to T + 9: Collection properties changes, fires watches to Solr. Solr receives the watch and reads from Zookeeper 3. T + 10 cache expires With cache, we read from Zookeeper 10 times, and ZooKeeper fires 10 watches. Without cache, we read once, ZooKeeper doens't fire any watch. Keep in mind that some clusters may have many collections (hundreds/thousands?), this may add a lot of load to ZooKeeper for things that aren’t going to be needed. {quote} Ah yes, I was evaluating in terms of reads caused by user action. I have assumed (I state it in one of the comments in code I think) that collection properties are not frequently updated. What you say is true, but it's no worse than having a watch set. I am amenable to making the expiration time tunable via configuration if you think frequent updates to collection properties for an individual collection are a realistic use case. My assumption is that they are likely to be set at initialization, and perhaps changed when an admin or autoscalling wants to adjust something. If something is thrashing collection properties I think that's the problem and this would be a symptom. So to summarize this case: *TLDR;*This patch does pose a risk for systems with large numbers of collections that also frequently update collection properties on many of those collections if that system was +not already using collectionPropertiesWatches+ (or only using them on a few collections) prior to this patch (i.e. most updates to properties are ignored). That seems like a pretty narrow use case, since it implies that the majority of updates are going un-read unless they are making frequent calls to getcollectionProperties() and perhaps should have been setting a watch in the first place... This case (which I thought of but discarded as very unlikely) could be mitigated by shortening the cache expiration time and the cache eviction frequency interval here, or making them configurable. {code:java} // ten minutes -- possibly make this configurable private static final long CACHE_COLLECTION_PROPS_FOR_NANOS = 10L * 60 * 1000 * 1000 * 1000; // one minute -- possibly make this configurable public static final int CACHE_COLLECTION_PROPS_REAPER_INTERVAL = 60000; {code} Before we do that however, are we really meaning to support use cases where people thrash values we store in zookeeper? Or is this an anti-pattern like individual requests for updates rather than batching? My gut says the later, but maybe I'm out in left field on that? {quote}Case 2: A component doesn’t rely on a listener, but relies on cache. 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on cache 2. T + 10, cache expires 3. T + 11: call to getCollectionProperties(), Zk watch is set and element is on cache 4. T + 20, cache expires 5. … With a listener, this is just one read. With cache, this is, again, unknown, but up to N, the number of calls to getCollectionProperties() {quote} This is identical to my case 9, and no change vs the current code which would read exactly N times instead of up to N times, and if the system can't handle a call to zookeeper every 10 minutes (once every 0.6 seconds with a thousand collections), then this is the veritable straw the broke the camel's back, and the solution is to use more camels, or carry a lighter load rather rather than live in mortal fear of a stray bit of straw. Again if this is expected to pose a problem, (again only in systems that don't have watches already) increasing the timeout duration mitigates it. > Make collection properties easier and safer to use in code > ---------------------------------------------------------- > > Key: SOLR-13439 > URL: https://issues.apache.org/jira/browse/SOLR-13439 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: master (9.0) > Reporter: Gus Heck > Assignee: Gus Heck > Priority: Major > Attachments: SOLR-13439.patch, SOLR-13439.patch > > > (breaking this out from SOLR-13420, please read there for further background) > Before this patch the api is quite confusing (IMHO): > # any code that wanted to know what the properties for a collection are > could call zkStateReader.getCollectionProperties(collection) but this was a > dangerous and trappy API because that was a query to zookeeper every time. If > a naive user auto-completed that in their IDE without investigating, heavy > use of zookeeper would ensue. > # To "do it right" for any code that might get called on a per-doc or per > request basis one had to cause caching by registering a watcher. At which > point the getCollectionProperties(collection) magically becomes safe to use, > but the watcher pattern probably looks famillar induces a user who hasn't > read the solr code closely to create their own cache and update it when their > watcher is notified. If the caching side effect of watches isn't understood > this will lead to many in-memory copies of collection properties maintained > in user code. > # This also creates a task to be scheduled on a thread (PropsNotification) > and induces an extra thread-scheduling lag before the changes can be observed > by user code. > # The code that cares about collection properties needs to have a lifecycle > tied to either a collection or something other object with an even more > ephemeral life cycle such as an URP. The user now also has to remember to > ensure the watch is unregistered, or there is a leak. > After this patch > # Calls to getCollectionProperties(collection) are always safe to use in any > code anywhere. Caching and cleanup are automatic. > # Code that really actually wants to know if a collection property changes > so it can wake up and do something (autoscaling?) still has the option of > registering a watcher that will asynchronously send them a notification. > # Updates can be observed sooner via getCollectionProperties with no need to > wait for a thread to run. (vs a cache held in user code) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org