Sowiks commented on PR #96: URL: https://github.com/apache/otava/pull/96#issuecomment-3538076972
> Thanks a lot @Sowiks for this! You have valuable skill in being able to grasp the academic level math and then still explain your findings to normal people with simple pictures. Btw this is why I like this tigerbeetle demo dataset from 2023. In 200+ points it exercises many of the phenomena you might encounter in this field, and so it captured your bug, or fix rather, too. > > Amazingly I vaguely remember how this happened at MongoDB back then. I remember asking about this kappa and the people who had read the jameson paper (I would read it much later) explained that we can choose a value for it freely. So we did and I never thought of it again. We thought of it as a parameter we could choose, not that we were supposed to use all values. Since the by-the-book algorithm ends in a monte carlo simulation, we apparently accepted the fact that the reference implementation in R often produced different change points. > > So it seems with your fix the algorithm will perform even better than it ever did. (And even now Otava has outperformed all alternatives with a good margin!) It now seems to hit the blind spots that always annoyed me. In a way Piotr's approach applying small windows kind of achieves the same behavior. > > Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way? If yes, then it means that storing the previous results becomes the norm and we should pay more attention to a format and api for doing that. > > Will review code over the weekend but from the text and pictures I can already tell this is good stuff. Thanks for contributing! Thank you for the flattering review :) Regarding your question: > Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way? It's kind of a loaded question, but the short answer is no. However, I think you'll be interested in the long answer. It's not **exactly** the same because as we add a point to the end of series it might cause a different point to become the best candidate. The minimal example that I came up with is `[0, 29, 60]` and adding `27` to it: ```python >>> series = np.array([0, 29, 60]) >>> calculator.get_next_candidate(slice(0, None)) # whole series CandidateChangePoint(index=2, qhat=41.33333333333333) # its x_2 = 60 ``` However, ```python >>> series = np.array([0, 29, 60, 27]) >>> calculator.get_next_candidate(slice(0, None)) # whole series CandidateChangePoint(index=1, qhat=41.5) # its x_1 = 29 ``` Now, the intuition here is that when we had only three points, the jump `0 -> 29` is smaller than `29 -> 60`, so `60` has the best potential to be a change point. When we added 4th point, we had enough evidence to see that jump `0 -> [29, 60, 27]` has the most potential now. So, the answer is "no" in general. However, I'm not sure if this is possible as the length of sequence increases and whether one point can make a difference anymore. Next, if I understand correctly why you are asking this question, it is because of the issues described in the [Hunter paper](https://arxiv.org/abs/2301.03034). Namely, the claim: > As we began using Hunter on larger and larger data series, we discovered that change points identified in previous runs would suddenly disappear from Hunter’s results. This issue turned out to be caused by performance regressions that were fixed shortly after being introduced. This is a known issue with E-divisive means and is discussed in [5]. Because E-divisive means divides the time series into two parts, most of the data points on either side of the split showed similar values. The algorithm therefore, by design, would treat the two nearby changes as a temporary anomaly, rather than a persistent change, and therefore filter it out. Figure 1 illustrates this issue. <img width="2535" height="968" alt="hunter-regr-mod" src="https://github.com/user-attachments/assets/06edd83b-9733-4c83-a867-fe1ab1d3310b" /> I tried to generate data similar to the one on the picture and run a few tests: ```python >>> def figure1_test(N): ... base = 440 + np.random.randn(N) * 5 ... drop = 400 + np.random.randn(N) * 5 ... recover = 445 + np.random.randn(N) * 5 ... series = np.concatenate((base, drop, recover)) ... tester = PermutationsSignificanceTester(alpha=0.00001, permurations=100, calculator=PairDistanceCalculator, seed=1) ... detector = ChangePointDetector(tester, PairDistanceCalculator) ... points = detector.get_change_points(series) ... return [p.index for p in points] ... >>> figure1_test(10) [10, 20] >>> figure1_test(100) [100, 200] >>> figure1_test(1000) # took quite a while because of the permutation tester [1000, 2000] ``` As you can see (1) the change points are correctly identified for this pattern (2) on wide range of sequence length. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
