TBH I don't imagine anyone resurrecting the Autoscaling framework; filing a JIRA issue against removed code seems like a waste of one's time. But You/Pierre can file one. It ought to be closed immediately as "Wont' Fix" (I chose this resolution status recently for a several now defunct issues).
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Wed, Jun 7, 2023 at 12:35 PM Jason Gerlowski <gerlowsk...@gmail.com> wrote: > Thanks for sharing Pierre! It's clear you put some thought into the > clear writeup. > > I wonder whether this shouldn't go into a JIRA ticket so there's a > record of it that might be more visible to users? Would you be > willing to summarize this in a JIRA ticket? I'm not sure if anyone > will tackle the fix for 8x, but at least a ticket would document > things more visibly and it opens the door to someone else tackling the > problem down the road. > > Best, > > Jason > > On Thu, Jun 1, 2023 at 10:29 AM Pierre Salagnac > <pierre.salag...@gmail.com> wrote: > > > > I know the autoscaling framework does not exist anymore with Solr 9+, > but I > > wanted to share here a bug we found in it. > > Probably there are still plenty of Solr 8 users still relying on this > > framework. > > > > > > The triggers use timestamps returned by the JVM call System.nanoTime(), > but > > according to the Javadoc, this is NOT an absolute timestamp. This is > just a > > number relative to a random origin, and this origin will change each time > > the JVM is restarted. > > > > I figured out this impacts at least the following triggers (with > basically > > the same pattern), > > - IndexSizeTrigger > > - MetricTrigger > > - SearchRateTrigger > > > > These triggers want to fire an event when a certain condition (depending > on > > each trigger) is met for a certain period of time. They maintain a map > with > > [what, timestamp] entries to track a short term history, with the option > to > > remove an entry if the condition is not met anymore, so we don't trigger > > any event. > > Timestamps come from System.nanoTime(). So far so good as long as we > > compare these timestamps to each others in the same JVM. Now, this map is > > persisted in Zookeeper in case of an overseer change (written and read by > > TriggerBase.saveState() and restoreState() ). With an overseer change, > the > > nanoTime() origin is randomly moved to something else. Consequently, all > > the persisted timestamps from the previous overseer cannot be compared > with > > the current JVM "clock". > > This ends in triggers never being fired, or being fired without waiting > for > > the time configured. > > > > I found no Jira entry for this (but maybe there is one?), and I think > this > > could be a major contributor to the instability of this framework for > some > > environments. > > Also, I'm unsure whether it is still maintained in a 8x branch. > > > > Simple fix could be to always use TimeSource.getEpochTimeNs() instead > > of getTimeNs() in autoscaling code. > > But I'm not sure why we use nano seconds anyway. Seconds would be > > sufficient... > > > > Thanks, > > Pierre > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > >