Sowiks commented on PR #96:
URL: https://github.com/apache/otava/pull/96#issuecomment-3538076972

   > Thanks a lot @Sowiks for this! You have valuable skill in being able to 
grasp the academic level math and then still explain your findings to normal 
people with simple pictures. Btw this is why I like this tigerbeetle demo 
dataset from 2023. In 200+ points it exercises many of the phenomena you might 
encounter in this field, and so it captured your bug, or fix rather, too.
   > 
   > Amazingly I vaguely remember how this happened at MongoDB back then. I 
remember asking about this kappa and the people who had read the jameson paper 
(I would read it much later) explained that we can choose a value for it 
freely. So we did and I never thought of it again. We thought of it as a 
parameter we could choose, not that we were supposed to use all values. Since 
the by-the-book algorithm ends in a monte carlo simulation, we apparently 
accepted the fact that the reference implementation in R often produced 
different change points.
   > 
   > So it seems with your fix the algorithm will perform even better than it 
ever did. (And even now Otava has outperformed all alternatives with a good 
margin!) It now seems to hit the blind spots that always annoyed me. In a way 
Piotr's approach applying small windows kind of achieves the same behavior.
   > 
   > Do I understand correctly that running this Kappa from 0 to T is exactly 
the same as if I would start with two points, then append one point at a time 
to the timeseries, re-running otava between each step, and then keeping all 
change points found along the way? If yes, then it means that storing the 
previous results becomes the norm and we should pay more attention to a format 
and api for doing that.
   > 
   > Will review code over the weekend but from the text and pictures I can 
already tell this is good stuff. Thanks for contributing!
   
   Thank you for the flattering review :)
   
   Regarding your question:
   > Do I understand correctly that running this Kappa from 0 to T is exactly 
the same as if I would start with two points, then append one point at a time 
to the timeseries, re-running otava between each step, and then keeping all 
change points found along the way?
   
   It's kind of a loaded question, but the short answer is no. However, I think 
you'll be interested in the long answer.
   
   It's not **exactly** the same because as we add a point to the end of series 
it might cause a different point to become the best candidate. The minimal 
example that I came up with is `[0, 29, 60]` and adding `27` to it:
   ```python
   >>> series = np.array([0, 29, 60])
   >>> calculator.get_next_candidate(slice(0, None)) #  whole series
   CandidateChangePoint(index=2, qhat=41.33333333333333) # its x_2 = 60
   ```
   However,
   ```python
   >>> series = np.array([0, 29, 60, 27])
   >>> calculator.get_next_candidate(slice(0, None)) #  whole series
   CandidateChangePoint(index=1, qhat=41.5) # its x_1 = 29
   ```
   Now, the intuition here is that when we had only three points, the jump `0 
-> 29` is smaller than `29 -> 60`, so `60` has the best potential to be a 
change point. When we added 4th point, we had enough evidence to see that jump 
`0 -> [29, 60, 27]` has the most potential now. So, the answer is "no" in 
general. However, I'm not sure if this is possible as the length of sequence 
increases and whether one point can make a difference anymore.
   
   Next, if I understand correctly why you are asking this question, it is 
because of the issues described in the [Hunter 
paper](https://arxiv.org/abs/2301.03034). Namely, the claim:
   > As we began using Hunter on larger and larger data series, we discovered 
that change points identified in previous runs would suddenly disappear from 
Hunter’s results. This issue turned out to be caused by performance regressions 
that were fixed shortly after being introduced. This is a known issue with 
E-divisive means and is discussed in [5]. Because E-divisive means divides the 
time series into two parts, most of the data points on either side of the split 
showed similar values. The algorithm therefore, by design, would treat the two 
nearby changes as a temporary anomaly, rather than a persistent change, and 
therefore filter it out.
   Figure 1 illustrates this issue.
   
   <img width="2535" height="968" alt="hunter-regr-mod" 
src="https://github.com/user-attachments/assets/06edd83b-9733-4c83-a867-fe1ab1d3310b";
 />
   
   I tried to generate data similar to the one on the picture and run a few 
tests:
   ```python
   >>> def figure1_test(N):
   ...     base = 440 + np.random.randn(N) * 5
   ...     drop = 400 + np.random.randn(N) * 5
   ...     recover = 445 + np.random.randn(N) * 5
   ...     series = np.concatenate((base, drop, recover))
   ...     tester = PermutationsSignificanceTester(alpha=0.00001, 
permurations=100, calculator=PairDistanceCalculator, seed=1)
   ...     detector = ChangePointDetector(tester, PairDistanceCalculator)
   ...     points = detector.get_change_points(series)
   ...     return [p.index for p in points]
   ...
   >>> figure1_test(10)
   [10, 20]
   >>> figure1_test(100)
   [100, 200]
   >>> figure1_test(1000) #  took quite a while because of the permutation 
tester
   [1000, 2000]
   ```
   As you can see (1) the change points are correctly identified for this 
pattern (2) on wide range of sequence length.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to