[ 
https://issues.apache.org/jira/browse/SOLR-13439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836424#comment-16836424
 ] 

Gus Heck edited comment on SOLR-13439 at 5/9/19 2:39 PM:
---------------------------------------------------------

{quote}Case 1 (Frequent access):
 1. T+0: SolrCore starts (i.e. A replica is added, a collection is created, you 
name it)
 2. T+0: On initialization, a component that relies on collection properties 
registers a listener. Solr reads from ZooKeeper once.
 3. T+0 to T+end-of-life-of-the-core: The watch remains.
 4. T+end-of-life-of-the-core: no more reads are expected, the watch is removed.
{quote}
This identical to my Case 1 if the core lives for 30 minutes?
{quote}Times read with current approach: 1 + Number of modifications of 
properties.
 Times with the cache approach: unknown, at least the same as with the 
listener, but depends on when the properties are accessed.
{quote}
Not true. One (or possibly zero reads if we luck into my case 3) on step 1 
(instead of step 2 as you list it)... Properties remain cached (with updates 
when they change in zk of course) until core is removed. This is the whole 
point of why I wrote ConditionaExpiringCache. While the watch is active, it 
will exist in collectionPropsWatches, and the predicate the cache is created 
with will evaluate to true and the timeout will be refreshed without removing 
the entry
{code:java}
                    if (condition.test(peek)) {
                      // we passed our stay alive condition refresh our 
time-out and put back in the queue.
                      peek.expireAt = System.nanoTime() + retainForNanos;
                      timeSortedEntries.add(peek);
                    } else {
                      // we are past our time limit, and have failed our 
condition to stay alive so remove from cache
                      cache.remove(peek.key);
                    } {code}
I expect that the typo you caught and I corrected in the latest version may 
have thrown you off of realizing that the cache never expires while a watch is 
set. In that respect my patch is backwards compatible.
{quote}I can give you two cases:
 Case 1: Collection properties are accessed infrequently (like in my “case 2 
above”), but collection properties change frequently (i.e. every second)
 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on 
cache
 2. T + 1 to T + 9: Collection properties changes, fires watches to Solr. Solr 
receives the watch and reads from Zookeeper
 3. T + 10 cache expires
 With cache, we read from Zookeeper 10 times, and ZooKeeper fires 10 watches. 
Without cache, we read once, ZooKeeper doens't fire any watch. Keep in mind 
that some clusters may have many collections (hundreds/thousands?), this may 
add a lot of load to ZooKeeper for things that aren’t going to be needed.
{quote}
Ah yes, I was evaluating in terms of reads caused by user action. I have 
assumed (I state it in one of the comments in code I think) that collection 
properties are not frequently updated. What you say is true, but it's no worse 
than having a watch set. I am amenable to making the expiration time tunable 
via configuration if you think frequent updates to collection properties for an 
individual collection are a realistic use case. My assumption is that they are 
likely to be set at initialization, and perhaps changed when an admin or 
autoscalling wants to adjust something. If something is thrashing collection 
properties I think that's the problem and this would be a symptom. So to 
summarize this case:

*TLDR;*This patch does pose a risk for systems with large numbers of 
collections that also frequently update collection properties on many of those 
collections if that system was +not already using collectionPropertiesWatches+ 
(or only using them on a few collections) prior to this patch (i.e. most 
updates to properties are ignored).

That seems like a pretty narrow use case, since it implies that the majority of 
updates are going un-read unless they are making frequent calls to 
getcollectionProperties() and perhaps should have been setting a watch in the 
first place... This case (which I thought of but discarded as very unlikely) 
could be mitigated by shortening the cache expiration time and the cache 
eviction frequency interval here, or making them configurable.
{code:java}
  // ten minutes -- possibly make this configurable
  private static final long CACHE_COLLECTION_PROPS_FOR_NANOS = 10L * 60 * 1000 
* 1000 * 1000;

  // one minute -- possibly make this configurable
  public static final int CACHE_COLLECTION_PROPS_REAPER_INTERVAL = 60000;
{code}
Before we do that however, are we really meaning to support use cases where 
people thrash values we store in zookeeper? Or is this an anti-pattern like 
individual requests for updates rather than batching? My gut says the later, 
but maybe I'm out in left field on that?
{quote}Case 2: A component doesn’t rely on a listener, but relies on cache. 
 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on 
cache
 2. T + 10, cache expires
 3. T + 11: call to getCollectionProperties(), Zk watch is set and element is 
on cache
 4. T + 20, cache expires
 5. …
 With a listener, this is just one read. With cache, this is, again, unknown, 
but up to N, the number of calls to getCollectionProperties()
{quote}
This is identical to my case 5, and no change vs the current code which would 
read exactly N times instead of up to N times, and if the system can't handle a 
call to zookeeper every 10 minutes (once every 0.6 seconds with a thousand 
collections), then this is the veritable straw the broke the camel's back, and 
the solution is to use more camels, or carry a lighter load rather rather than 
live in mortal fear of a stray bit of straw. Again if this is expected to pose 
a problem, (again only in systems that don't have watches already) increasing 
the timeout duration mitigates it.


was (Author: gus_heck):
{quote}Case 1 (Frequent access):
 1. T+0: SolrCore starts (i.e. A replica is added, a collection is created, you 
name it)
 2. T+0: On initialization, a component that relies on collection properties 
registers a listener. Solr reads from ZooKeeper once.
 3. T+0 to T+end-of-life-of-the-core: The watch remains.
 4. T+end-of-life-of-the-core: no more reads are expected, the watch is removed.
{quote}
This identical to my Case 1 if the core lives for 30 minutes?
{quote}Times read with current approach: 1 + Number of modifications of 
properties.
 Times with the cache approach: unknown, at least the same as with the 
listener, but depends on when the properties are accessed.
{quote}
Not true. One (or possibly zero reads if we luck into my case 3) on step 1 
(instead of step 2 as you list it)... Properties remain cached (with updates 
when they change in zk of course) until core is removed. This is the whole 
point of why I wrote ConditionaExpiringCache. While the watch is active, it 
will exist in collectionPropsWatches, and the predicate the cache is created 
with will evaluate to true and the timeout will be refreshed without removing 
the entry
{code:java}
                    if (condition.test(peek)) {
                      // we passed our stay alive condition refresh our 
time-out and put back in the queue.
                      peek.expireAt = System.nanoTime() + retainForNanos;
                      timeSortedEntries.add(peek);
                    } else {
                      // we are past our time limit, and have failed our 
condition to stay alive so remove from cache
                      cache.remove(peek.key);
                    } {code}
I expect that the typo you caught and I corrected in the latest version may 
have thrown you off of realizing that the cache never expires while a watch is 
set. In that respect my patch is backwards compatible.
{quote}I can give you two cases:
 Case 1: Collection properties are accessed infrequently (like in my “case 2 
above”), but collection properties change frequently (i.e. every second)
 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on 
cache
 2. T + 1 to T + 9: Collection properties changes, fires watches to Solr. Solr 
receives the watch and reads from Zookeeper
 3. T + 10 cache expires
 With cache, we read from Zookeeper 10 times, and ZooKeeper fires 10 watches. 
Without cache, we read once, ZooKeeper doens't fire any watch. Keep in mind 
that some clusters may have many collections (hundreds/thousands?), this may 
add a lot of load to ZooKeeper for things that aren’t going to be needed.
{quote}
Ah yes, I was evaluating in terms of reads caused by user action. I have 
assumed (I state it in one of the comments in code I think) that collection 
properties are not frequently updated. What you say is true, but it's no worse 
than having a watch set. I am amenable to making the expiration time tunable 
via configuration if you think frequent updates to collection properties for an 
individual collection are a realistic use case. My assumption is that they are 
likely to be set at initialization, and perhaps changed when an admin or 
autoscalling wants to adjust something. If something is thrashing collection 
properties I think that's the problem and this would be a symptom. So to 
summarize this case:

*TLDR;*This patch does pose a risk for systems with large numbers of 
collections that also frequently update collection properties on many of those 
collections if that system was +not already using collectionPropertiesWatches+ 
(or only using them on a few collections) prior to this patch (i.e. most 
updates to properties are ignored).

That seems like a pretty narrow use case, since it implies that the majority of 
updates are going un-read unless they are making frequent calls to 
getcollectionProperties() and perhaps should have been setting a watch in the 
first place... This case (which I thought of but discarded as very unlikely) 
could be mitigated by shortening the cache expiration time and the cache 
eviction frequency interval here, or making them configurable.
{code:java}
  // ten minutes -- possibly make this configurable
  private static final long CACHE_COLLECTION_PROPS_FOR_NANOS = 10L * 60 * 1000 
* 1000 * 1000;

  // one minute -- possibly make this configurable
  public static final int CACHE_COLLECTION_PROPS_REAPER_INTERVAL = 60000;
{code}
Before we do that however, are we really meaning to support use cases where 
people thrash values we store in zookeeper? Or is this an anti-pattern like 
individual requests for updates rather than batching? My gut says the later, 
but maybe I'm out in left field on that?
{quote}Case 2: A component doesn’t rely on a listener, but relies on cache. 
 1. T + 0: call to getCollectionProperties(), Zk watch is set and element is on 
cache
 2. T + 10, cache expires
 3. T + 11: call to getCollectionProperties(), Zk watch is set and element is 
on cache
 4. T + 20, cache expires
 5. …
 With a listener, this is just one read. With cache, this is, again, unknown, 
but up to N, the number of calls to getCollectionProperties()
{quote}
This is identical to my case 9, and no change vs the current code which would 
read exactly N times instead of up to N times, and if the system can't handle a 
call to zookeeper every 10 minutes (once every 0.6 seconds with a thousand 
collections), then this is the veritable straw the broke the camel's back, and 
the solution is to use more camels, or carry a lighter load rather rather than 
live in mortal fear of a stray bit of straw. Again if this is expected to pose 
a problem, (again only in systems that don't have watches already) increasing 
the timeout duration mitigates it.

> Make collection properties easier and safer to use in code
> ----------------------------------------------------------
>
>                 Key: SOLR-13439
>                 URL: https://issues.apache.org/jira/browse/SOLR-13439
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (9.0)
>            Reporter: Gus Heck
>            Assignee: Gus Heck
>            Priority: Major
>         Attachments: SOLR-13439.patch, SOLR-13439.patch
>
>
> (breaking this out from SOLR-13420, please read there for further background)
> Before this patch the api is quite confusing (IMHO):
>  # any code that wanted to know what the properties for a collection are 
> could call zkStateReader.getCollectionProperties(collection) but this was a 
> dangerous and trappy API because that was a query to zookeeper every time. If 
> a naive user auto-completed that in their IDE without investigating, heavy 
> use of zookeeper would ensue.
>  # To "do it right" for any code that might get called on a per-doc or per 
> request basis one had to cause caching by registering a watcher. At which 
> point the getCollectionProperties(collection) magically becomes safe to use, 
> but the watcher pattern probably looks famillar induces a user who hasn't 
> read the solr code closely to create their own cache and update it when their 
> watcher is notified. If the caching side effect of watches isn't understood 
> this will lead to many in-memory copies of collection properties maintained 
> in user code.
>  # This also creates a task to be scheduled on a thread (PropsNotification) 
> and induces an extra thread-scheduling lag before the changes can be observed 
> by user code.
>  # The code that cares about collection properties needs to have a lifecycle 
> tied to either a collection or something other object with an even more 
> ephemeral life cycle such as an URP. The user now also has to remember to 
> ensure the watch is unregistered, or there is a leak.
> After this patch
>  # Calls to getCollectionProperties(collection) are always safe to use in any 
> code anywhere. Caching and cleanup are automatic.
>  # Code that really actually wants to know if a collection property changes 
> so it can wake up and do something (autoscaling?) still has the option of 
> registering a watcher that will asynchronously send them a notification.
>  # Updates can be observed sooner via getCollectionProperties with no need to 
> wait for a thread to run. (vs a cache held in user code)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to