henrikingo commented on code in PR #96: URL: https://github.com/apache/otava/pull/96#discussion_r2524996795
########## otava/change_point_divisive/base.py: ########## @@ -0,0 +1,109 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from dataclasses import dataclass, fields +from typing import Generic, List, Optional, TypeVar + +from numpy.typing import NDArray + + +@dataclass +class CandidateChangePoint: + '''Candidate for a change point. The point that maximizes Q-hat function on [start:end+1] slice''' + index: int Review Comment: Maybe it would be clearer if somewhere you also use either: start < index <= end or the equivalent )start, end) and then say, "...which correspond to the slice [start:end+1] in python, as well as range(start,end+1) Could add a mention that indexes are always the original index from the full series [0,T] and not zero-based for each interval. (`start < index <= end`, never `0 < index <= end-start` A change point at index 0 is impossible, because a single point just cannot change But this IMO follows logically, it is not a pre-condition. (There's a fun philosophical debate here, where exactly the change points are? In the case of a series of git commits, it is clear that the change point is the commit that causes the regression/improvement. Others could argue change is what happens in the gaps between the points of measurement.) In Otava we index change points from 1 to T, because this way they match with their corresponding test result in the input series. So you could make the conclusion Otava is in the camp that change happens, or is observed at least, at the point after the change. Anyway, should we add an invariant here that index > 0? ########## otava/analysis.py: ########## @@ -170,10 +126,42 @@ def compare(self, left: np.ndarray, right: np.ndarray) -> ComparativeStats: ) else: p = 1.0 - return ComparativeStats(mean_l, mean_r, std_l, std_r, p) + return TTestStats(mean_1=mean_l, mean_2=mean_r, std_1=std_l, std_2=std_r, pvalue=p) + + def change_point( + self, candidate: CandidateChangePoint, series: Sequence[SupportsFloat], intervals: List[slice] + ) -> ChangePoint[TTestStats]: + """ + Computes properties of the change point if the Candidate Change Point based on the provided intervals. + + The method works for two cases: + 1. Divisive algorithm. + if the candidate is a new potential change point, i.e., its index is inside any interval, then + we split the interval by the candidate's index to get left and right subseries. + 2. Merge step in t-test algorithm. Review Comment: Terminology: I would say any variant of the algorithm can use a T-Test, Permutation test, or something else. The Merge step is part of what I'd call split-merge strategy, or perhaps weak change points version. I guess the Hunter paper just called this the Fixed size windows, but I could think of many kinds of windows (such as sliding) that don't need to be merged. And this is at least closely related to the weak change points, but I'm unsure whether weak change points will be needed now? Otoh maybe we'll soon get rid of the split-merge too, in which case we can ignore this discussion. ########## otava/analysis.py: ########## @@ -259,13 +247,13 @@ def split(series: np.array, window_len: int = 30, max_pvalue: float = 0.001, assert "Window length must be at least 2", window_len >= 2 start = 0 step = int(window_len / 2) - indexes = [] + change_points = [] # N new_points are appended to the end of series. Typically N=1. # old_cp are the weak change points from before new points were added. # We now just compute e-e_divisive for the tail of the series, beginning at # max(old_cp[-1], a step that is over 2 window_len from the end) if new_points is not None and old_cp is not None: - indexes = [c.index for c in old_cp] + change_points = old_cp[:] Review Comment: I like this change (throughout) ########## otava/analysis.py: ########## @@ -222,22 +210,22 @@ def merge( # We can't continue yet, because by removing a change_point # the adjacent change points changed their properties. # Recompute the adjacent change point stats: - window_endpoints = [0] + [cp.index for cp in change_points] + [len(series)] + intervals = tester.get_intervals(change_points) def recompute(index: int): if index < 0 or index >= len(change_points): return cp = change_points[index] - change_points[index] = tester.change_point(cp.index, series, window_endpoints) + change_points[index] = tester.change_point(cp.to_candidate(), series, intervals) recompute(weakest_cp_index) recompute(weakest_cp_index + 1) Review Comment: Do we test for the case where weakest_cp_index == max(index) ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between Review Comment: matrix ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function + `A` are given by matrix + + | A(start, start + 1, start + 2) A(start, start + 1, start + 3) … A(start, start + 1, end + 1) | + | 0 A(start, start + 2, start + 3) … A(start, start + 2, end + 1) | + A = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … A(start, end, end + 1) | + + Matrices B and C follow the same index-to-value pattern. + + Matrices A, B, and C are used to compute values of function `Q` recursively, without + recomputing the same sums over and over again. The recursive formulas were further + transformed to closed forms, so they can be computed using numpy cummulative sum + function to take advantage of numpy vectorized operations. (Note that each column + in matrices A, B, C can be repersented through np.cumsum). The formulas use + commulative sum of functions `H` and `V`, which definitions can be found in + `_calculate_distances` method. + + A(start, τ, κ) = 2 / (κ - s) * (Σt=start,τ-1 H(start, t, κ) - Σt=start+1,τ-1 V(start, t, κ)), + + B(start, τ, κ) = 2 * (κ - τ) / (κ - start) / (τ - start - 1) * Σt=start+1,τ-1 V(start, t, κ), + + C(start, τ, κ) = 2 * (τ - start) / (κ - start) / (κ - τ - 1) * Σt=τ,κ-2 H(start, t, κ).''' + if self.V is None or self.H is None: + self._calculate_distances() + + V = self.V[start : end] - self.distances[0 : start, start + 1 : end + 1].sum(axis=0) + H = self.H[start : end, start : end] + + taus = np.arange(start + 1, end + 1)[:, None] + kappas = np.arange(start + 2, end + 2)[None, :] + + A = np.zeros((end - start, end - start)) + A_coefs = 2 / (kappas - start) + A[1:, :] = np.cumsum(V)[:-1, None] + A = A_coefs * np.triu(np.cumsum(H, axis=0) - A, k=0) + + B = np.zeros((end - start, end - start)) + B_num = 2 * (kappas - taus) + B_den = (kappas - start) * (taus - start - 1) + B_mask = np.triu(np.ones_like(B_den, dtype=bool), k=0) + B_out = np.zeros_like(B_den, dtype=float) + B_coefs = np.divide(B_num, B_den, out=B_out, where=B_mask & (B_den != 0)) + B[1:, 1:] = B_coefs[1:, 1:] * np.cumsum(V)[:-1, None] + + C = np.zeros((end - start, end - start)) + C_num = 2 * (taus - start) + C_den = (kappas - start) * (kappas - taus - 1) + C_mask = np.triu(np.ones_like(C_den, dtype=bool), k=1) + C_out = np.zeros_like(C_den, dtype=float) + C_coefs = np.divide(C_num, C_den, out=C_out, where=C_mask & (C_den != 0)) + C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0)) + + # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`. + # So, critical point is `τ = i + 1`. + return A - B - C + + def get_candidate_change_point(self, interval: slice) -> CandidateChangePoint: + '''For a given `slice(start, stop)` finds potential critical point in subsequence series[slice], + i.e., from index `start` to `stop - 1` inclusive. For simplicity, we'll use `end = stop - 1`. + + + The potential critical point is defined as the following. + Suppose for given `start` and `end` the pair `(τ, κ)` maximizes function + Q(τ, κ) = QQ(series[start : τ], series[τ : κ]), + where `τ` and `κ` are such that both `series[start : τ]` and `series[τ : κ]` are non-empty. + Then `τ` is a potential critical point (i.e., the first element of `series[τ : κ]`). + Note: for `series[start:τ]` and `series[τ:κ]` to be non-empty, we have + start < τ < κ <= end + 1 = len(series). + Note: start < τ => sequence[0] cannot be a critical point. + + + Function `QQ` is defined as the following: + Let `Z` be a series `[Z_{start}, Z_{start + 1}, ..., Z_{end}]`, and `X` and `Y` + be a split of `Z`, i.e., + X = [Z_{start}, Z_{start + 1}, ..., Z_a] = [X_1, X_2, ..., X_n], + Y = [Z_{a + 1}, Z_{a + 2}, ..., Z_b] = [Y_1, Y_2, ..., Y_m], + then + QQ(X, Y) = 2 / (m + n) * Σi=1,n Σj=1,m |X_i - Y_j| + - 2 * m / (m + n) / (n - 1) * Σ1<=i<k<=n |X_i - X_k| + - 2 * n / (m + n) / (m - 1) * Σ1<=j<k<=m |Y_j - Y_k|. + + Note: QQ(X, Y) = QQ(Z[start : a + 1], Z[a + 1 : b + 1]) = Q(a + 1, b + 1). + + + This method computes all values of `Q` for given `start` and `end`, and returns index + of the potential critical point and the maximized value of `Q`.''' + + start = 0 if interval.start is None else interval.start + end = len(self.series) - 1 if interval.stop is None else interval.stop - 1 + assert start < end, f"start={start} must be less than end={end}" + + Q = self._get_Q_vals(start, end) + i, j = np.unravel_index(np.argmax(Q), Q.shape) + return CandidateChangePoint(index=i + 1 + start, qhat=Q[i][j]) Review Comment: Also this feels reassuringly familiar <3 ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function + `A` are given by matrix + + | A(start, start + 1, start + 2) A(start, start + 1, start + 3) … A(start, start + 1, end + 1) | + | 0 A(start, start + 2, start + 3) … A(start, start + 2, end + 1) | + A = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … A(start, end, end + 1) | + + Matrices B and C follow the same index-to-value pattern. + + Matrices A, B, and C are used to compute values of function `Q` recursively, without + recomputing the same sums over and over again. The recursive formulas were further + transformed to closed forms, so they can be computed using numpy cummulative sum + function to take advantage of numpy vectorized operations. (Note that each column + in matrices A, B, C can be repersented through np.cumsum). The formulas use + commulative sum of functions `H` and `V`, which definitions can be found in + `_calculate_distances` method. + + A(start, τ, κ) = 2 / (κ - s) * (Σt=start,τ-1 H(start, t, κ) - Σt=start+1,τ-1 V(start, t, κ)), + + B(start, τ, κ) = 2 * (κ - τ) / (κ - start) / (τ - start - 1) * Σt=start+1,τ-1 V(start, t, κ), + + C(start, τ, κ) = 2 * (τ - start) / (κ - start) / (κ - τ - 1) * Σt=τ,κ-2 H(start, t, κ).''' + if self.V is None or self.H is None: + self._calculate_distances() + + V = self.V[start : end] - self.distances[0 : start, start + 1 : end + 1].sum(axis=0) + H = self.H[start : end, start : end] + + taus = np.arange(start + 1, end + 1)[:, None] + kappas = np.arange(start + 2, end + 2)[None, :] + + A = np.zeros((end - start, end - start)) + A_coefs = 2 / (kappas - start) + A[1:, :] = np.cumsum(V)[:-1, None] + A = A_coefs * np.triu(np.cumsum(H, axis=0) - A, k=0) + + B = np.zeros((end - start, end - start)) + B_num = 2 * (kappas - taus) + B_den = (kappas - start) * (taus - start - 1) + B_mask = np.triu(np.ones_like(B_den, dtype=bool), k=0) + B_out = np.zeros_like(B_den, dtype=float) + B_coefs = np.divide(B_num, B_den, out=B_out, where=B_mask & (B_den != 0)) + B[1:, 1:] = B_coefs[1:, 1:] * np.cumsum(V)[:-1, None] + + C = np.zeros((end - start, end - start)) + C_num = 2 * (taus - start) + C_den = (kappas - start) * (kappas - taus - 1) + C_mask = np.triu(np.ones_like(C_den, dtype=bool), k=1) + C_out = np.zeros_like(C_den, dtype=float) + C_coefs = np.divide(C_num, C_den, out=C_out, where=C_mask & (C_den != 0)) + C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0)) + + # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`. + # So, critical point is `τ = i + 1`. + return A - B - C + + def get_candidate_change_point(self, interval: slice) -> CandidateChangePoint: + '''For a given `slice(start, stop)` finds potential critical point in subsequence series[slice], + i.e., from index `start` to `stop - 1` inclusive. For simplicity, we'll use `end = stop - 1`. + + + The potential critical point is defined as the following. + Suppose for given `start` and `end` the pair `(τ, κ)` maximizes function + Q(τ, κ) = QQ(series[start : τ], series[τ : κ]), + where `τ` and `κ` are such that both `series[start : τ]` and `series[τ : κ]` are non-empty. + Then `τ` is a potential critical point (i.e., the first element of `series[τ : κ]`). + Note: for `series[start:τ]` and `series[τ:κ]` to be non-empty, we have + start < τ < κ <= end + 1 = len(series). + Note: start < τ => sequence[0] cannot be a critical point. + Review Comment: That's more like it! I'll leave all my other commentary for philosophical value. But please review, I think earlier when you end up having +2 here and there, you've done a +1 once too many somewhere. (Result can still be correct, readabilit suffers. ########## otava/analysis.py: ########## @@ -278,35 +266,31 @@ def split(series: np.array, window_len: int = 30, max_pvalue: float = 0.001, tester = TTestSignificanceTester(max_pvalue) while start < len(series): end = min(start + window_len, len(series)) - calculator = cext_calculator - algo = EDivisive(seed=None, calculator=calculator, significance_tester=tester) - pts = algo.get_change_points(series[start:end]) - new_indexes = [p.index + start for p in pts] - new_indexes.sort() - last_new_change_point_index = next(iter(new_indexes[-1:]), 0) + algo = ChangePointDetector(significance_tester=tester, calculator=PairDistanceCalculator) + new_change_points = algo.get_change_points(series, start, end) + last_new_change_point_index = new_change_points[-1].index if new_change_points else 0 start = max(last_new_change_point_index, start + step) # incremental Otava can duplicate an old cp - for i in new_indexes: - if i not in indexes: - indexes += [i] + for cp in new_change_points: + if cp not in change_points: + change_points += [cp] - window_endpoints = [0] + indexes + [len(series)] - return [tester.change_point(i, series, window_endpoints) for i in indexes] + intervals = tester.get_intervals(change_points) + return [tester.change_point(cp.to_candidate(), series, intervals) for cp in change_points] -def compute_change_points_orig(series: np.array, max_pvalue: float = 0.001) -> List[ChangePoint]: - calculator = cext_calculator - tester = QHatPermutationsSignificanceTester(calculator, pvalue=max_pvalue, permutations=100) - algo = EDivisive(seed=None, calculator=calculator, significance_tester=tester) - pts = algo.get_change_points(series) - return pts, None +def compute_change_points_orig(series: Sequence[SupportsFloat], max_pvalue: float = 0.001, seed: Optional[int] = None) -> Tuple[PermCPList, Optional[PermCPList]]: + tester = PermutationsSignificanceTester(alpha=max_pvalue, permurations=100, calculator=PairDistanceCalculator, seed=seed) Review Comment: permutations ########## otava/analysis.py: ########## @@ -259,13 +247,13 @@ def split(series: np.array, window_len: int = 30, max_pvalue: float = 0.001, assert "Window length must be at least 2", window_len >= 2 start = 0 step = int(window_len / 2) - indexes = [] + change_points = [] Review Comment: Perhaps update the above comment, and check other cmments too. (grep mongodb and signal) ########## otava/analysis.py: ########## @@ -170,10 +126,42 @@ def compare(self, left: np.ndarray, right: np.ndarray) -> ComparativeStats: ) else: p = 1.0 - return ComparativeStats(mean_l, mean_r, std_l, std_r, p) + return TTestStats(mean_1=mean_l, mean_2=mean_r, std_1=std_l, std_2=std_r, pvalue=p) + + def change_point( + self, candidate: CandidateChangePoint, series: Sequence[SupportsFloat], intervals: List[slice] + ) -> ChangePoint[TTestStats]: + """ + Computes properties of the change point if the Candidate Change Point based on the provided intervals. + + The method works for two cases: + 1. Divisive algorithm. + if the candidate is a new potential change point, i.e., its index is inside any interval, then + we split the interval by the candidate's index to get left and right subseries. + 2. Merge step in t-test algorithm. + if the candidate is an existing change point, i.e., it matches the end of two intervals, then + it's a potential weak change point, and we don't need to split the intervals anymore (just take + both intervals as left and right subseries). + + """ + for i, interval in enumerate(intervals): + if interval.stop == candidate.index: + left_interval = interval + right_interval = intervals[i + 1] + break + elif (interval.start is None or interval.start < candidate.index) and (interval.stop is None or candidate.index < interval.stop): Review Comment: How can interval start ever be None? If the end points aren't known I'd say it's not an interval at all? ########## otava/analysis.py: ########## @@ -278,35 +266,31 @@ def split(series: np.array, window_len: int = 30, max_pvalue: float = 0.001, tester = TTestSignificanceTester(max_pvalue) while start < len(series): end = min(start + window_len, len(series)) - calculator = cext_calculator - algo = EDivisive(seed=None, calculator=calculator, significance_tester=tester) - pts = algo.get_change_points(series[start:end]) - new_indexes = [p.index + start for p in pts] - new_indexes.sort() - last_new_change_point_index = next(iter(new_indexes[-1:]), 0) + algo = ChangePointDetector(significance_tester=tester, calculator=PairDistanceCalculator) Review Comment: ChangePointDetector is the name of the entire algorithm? Ie is the more generic alternative to e-divisive? ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" Review Comment: That's a strong assertion. Why use this parameter at all? I think we only ever used 1. But 2 could be a way to de-emphasize small changes? ########## otava/change_point_divisive/base.py: ########## @@ -0,0 +1,109 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from dataclasses import dataclass, fields +from typing import Generic, List, Optional, TypeVar + +from numpy.typing import NDArray + + +@dataclass +class CandidateChangePoint: + '''Candidate for a change point. The point that maximizes Q-hat function on [start:end+1] slice''' + index: int + qhat: float + + +@dataclass +class BaseStats: + '''Abstract statistics class for change point. Implementation depends on the statistical test.''' + pvalue: float + + +GenericStats = TypeVar("GenericStats", bound=BaseStats) + + +@dataclass +class ChangePoint(CandidateChangePoint, Generic[GenericStats]): + '''Change point class, defined by index and signigicance test statistic.''' + stats: GenericStats + + def __eq__(self, other): + '''Helpful to identify new Change Points during divisive algorithm''' + return isinstance(other, self.__class__) and self.index == other.index + + @classmethod + def from_candidate(cls, candidate: CandidateChangePoint, stats: GenericStats) -> 'ChangePoint[GenericStats]': + return cls( + index=candidate.index, + qhat=candidate.qhat, + stats=stats, + ) + + def to_candidate(self) -> CandidateChangePoint: + '''Downgrades Change Point to a Candidate Change Point. Used to recompute stats for Weak Change Points.''' + data = {f.name: getattr(self, f.name) for f in fields(CandidateChangePoint)} + return CandidateChangePoint(**data) + + +class SignificanceTester(Generic[GenericStats]): + '''Abstract class for significance tester''' + + def __init__(self, alpha: float): + self.alpha = alpha Review Comment: I know what p-value is but what is alpha? ########## otava/change_point_divisive/base.py: ########## @@ -0,0 +1,109 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from dataclasses import dataclass, fields +from typing import Generic, List, Optional, TypeVar + +from numpy.typing import NDArray + + +@dataclass +class CandidateChangePoint: + '''Candidate for a change point. The point that maximizes Q-hat function on [start:end+1] slice''' + index: int + qhat: float + + +@dataclass +class BaseStats: + '''Abstract statistics class for change point. Implementation depends on the statistical test.''' + pvalue: float Review Comment: I would argue we always want mean on the left and right side and stddev too. Even if not used by the significance test, these are typically reported back to users, so this is a good opportunity to standardize that they are always here. Essentially the change in a change point is mean_right - mean_left, except for the interesting case where that is zero yet it's still a change point :-) ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): Review Comment: I don't mind this method name, but when explaining to people, I find that "pairwise differences" is layman-understandable. ########## otava/analysis.py: ########## @@ -259,13 +247,13 @@ def split(series: np.array, window_len: int = 30, max_pvalue: float = 0.001, assert "Window length must be at least 2", window_len >= 2 Review Comment: This assert is the wrong way >>> assert True, False >>> assert False, True Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError: True >>> ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function Review Comment: If I follow correctly, the +1 in above end+1 is really just because that's the custom in python arrays, ranges and for loops. (Ie array indexes are from 0 to length-1). My preferense would be that these comments stick to the coordinate system: - series has Length measurements, indexed from 0 to Length-1 - Change points share the same indexes 0 to Length-1, except that 0 is not possible. - If it's possible, I would even stick to that numbering also in all methods we define. - Then only inside methods do you have to do things like numpy.whatever(start, end+1) Does something go terribly wrong if we try the above? I believe then the inequality above has to be start < τ <= κ <= end Does this violently contradict notation in Matteson paper? ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function + `A` are given by matrix + + | A(start, start + 1, start + 2) A(start, start + 1, start + 3) … A(start, start + 1, end + 1) | + | 0 A(start, start + 2, start + 3) … A(start, start + 2, end + 1) | + A = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … A(start, end, end + 1) | + + Matrices B and C follow the same index-to-value pattern. + + Matrices A, B, and C are used to compute values of function `Q` recursively, without + recomputing the same sums over and over again. The recursive formulas were further + transformed to closed forms, so they can be computed using numpy cummulative sum + function to take advantage of numpy vectorized operations. (Note that each column + in matrices A, B, C can be repersented through np.cumsum). The formulas use + commulative sum of functions `H` and `V`, which definitions can be found in + `_calculate_distances` method. + + A(start, τ, κ) = 2 / (κ - s) * (Σt=start,τ-1 H(start, t, κ) - Σt=start+1,τ-1 V(start, t, κ)), + + B(start, τ, κ) = 2 * (κ - τ) / (κ - start) / (τ - start - 1) * Σt=start+1,τ-1 V(start, t, κ), + + C(start, τ, κ) = 2 * (τ - start) / (κ - start) / (κ - τ - 1) * Σt=τ,κ-2 H(start, t, κ).''' + if self.V is None or self.H is None: + self._calculate_distances() + + V = self.V[start : end] - self.distances[0 : start, start + 1 : end + 1].sum(axis=0) + H = self.H[start : end, start : end] + + taus = np.arange(start + 1, end + 1)[:, None] + kappas = np.arange(start + 2, end + 2)[None, :] + + A = np.zeros((end - start, end - start)) + A_coefs = 2 / (kappas - start) + A[1:, :] = np.cumsum(V)[:-1, None] + A = A_coefs * np.triu(np.cumsum(H, axis=0) - A, k=0) + + B = np.zeros((end - start, end - start)) + B_num = 2 * (kappas - taus) + B_den = (kappas - start) * (taus - start - 1) + B_mask = np.triu(np.ones_like(B_den, dtype=bool), k=0) + B_out = np.zeros_like(B_den, dtype=float) + B_coefs = np.divide(B_num, B_den, out=B_out, where=B_mask & (B_den != 0)) + B[1:, 1:] = B_coefs[1:, 1:] * np.cumsum(V)[:-1, None] + + C = np.zeros((end - start, end - start)) + C_num = 2 * (taus - start) + C_den = (kappas - start) * (kappas - taus - 1) + C_mask = np.triu(np.ones_like(C_den, dtype=bool), k=1) + C_out = np.zeros_like(C_den, dtype=float) + C_coefs = np.divide(C_num, C_den, out=C_out, where=C_mask & (C_den != 0)) + C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0)) + + # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`. + # So, critical point is `τ = i + 1`. + return A - B - C + + def get_candidate_change_point(self, interval: slice) -> CandidateChangePoint: + '''For a given `slice(start, stop)` finds potential critical point in subsequence series[slice], + i.e., from index `start` to `stop - 1` inclusive. For simplicity, we'll use `end = stop - 1`. Review Comment: To be super clear, maybe use "interval" and ,) if you are talking math, and slice with [] and range with () in python context ########## otava/analysis.py: ########## @@ -170,10 +126,42 @@ def compare(self, left: np.ndarray, right: np.ndarray) -> ComparativeStats: ) else: p = 1.0 - return ComparativeStats(mean_l, mean_r, std_l, std_r, p) + return TTestStats(mean_1=mean_l, mean_2=mean_r, std_1=std_l, std_2=std_r, pvalue=p) + + def change_point( + self, candidate: CandidateChangePoint, series: Sequence[SupportsFloat], intervals: List[slice] + ) -> ChangePoint[TTestStats]: + """ + Computes properties of the change point if the Candidate Change Point based on the provided intervals. + + The method works for two cases: + 1. Divisive algorithm. + if the candidate is a new potential change point, i.e., its index is inside any interval, then + we split the interval by the candidate's index to get left and right subseries. + 2. Merge step in t-test algorithm. Review Comment: Oh, maybe we haven't been introduced properly. I'm the one who is serious about naming things. David is the one who knows all about cache invalidation. ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function + `A` are given by matrix + + | A(start, start + 1, start + 2) A(start, start + 1, start + 3) … A(start, start + 1, end + 1) | + | 0 A(start, start + 2, start + 3) … A(start, start + 2, end + 1) | + A = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … A(start, end, end + 1) | + + Matrices B and C follow the same index-to-value pattern. + + Matrices A, B, and C are used to compute values of function `Q` recursively, without + recomputing the same sums over and over again. The recursive formulas were further + transformed to closed forms, so they can be computed using numpy cummulative sum + function to take advantage of numpy vectorized operations. (Note that each column + in matrices A, B, C can be repersented through np.cumsum). The formulas use + commulative sum of functions `H` and `V`, which definitions can be found in + `_calculate_distances` method. + + A(start, τ, κ) = 2 / (κ - s) * (Σt=start,τ-1 H(start, t, κ) - Σt=start+1,τ-1 V(start, t, κ)), + + B(start, τ, κ) = 2 * (κ - τ) / (κ - start) / (τ - start - 1) * Σt=start+1,τ-1 V(start, t, κ), + + C(start, τ, κ) = 2 * (τ - start) / (κ - start) / (κ - τ - 1) * Σt=τ,κ-2 H(start, t, κ).''' + if self.V is None or self.H is None: + self._calculate_distances() + + V = self.V[start : end] - self.distances[0 : start, start + 1 : end + 1].sum(axis=0) + H = self.H[start : end, start : end] + + taus = np.arange(start + 1, end + 1)[:, None] + kappas = np.arange(start + 2, end + 2)[None, :] + + A = np.zeros((end - start, end - start)) + A_coefs = 2 / (kappas - start) + A[1:, :] = np.cumsum(V)[:-1, None] + A = A_coefs * np.triu(np.cumsum(H, axis=0) - A, k=0) + + B = np.zeros((end - start, end - start)) + B_num = 2 * (kappas - taus) + B_den = (kappas - start) * (taus - start - 1) + B_mask = np.triu(np.ones_like(B_den, dtype=bool), k=0) + B_out = np.zeros_like(B_den, dtype=float) + B_coefs = np.divide(B_num, B_den, out=B_out, where=B_mask & (B_den != 0)) + B[1:, 1:] = B_coefs[1:, 1:] * np.cumsum(V)[:-1, None] + + C = np.zeros((end - start, end - start)) + C_num = 2 * (taus - start) + C_den = (kappas - start) * (kappas - taus - 1) + C_mask = np.triu(np.ones_like(C_den, dtype=bool), k=1) + C_out = np.zeros_like(C_den, dtype=float) + C_coefs = np.divide(C_num, C_den, out=C_out, where=C_mask & (C_den != 0)) + C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0)) + + # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`. + # So, critical point is `τ = i + 1`. Review Comment: The fact that you end up with parameters i+1, j+2 here, rather than the more common i, j+1, suggests to me you need to shift your indexing to the left one step. (0,T) or )0,T) not (1,T+1) ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function + `A` are given by matrix + + | A(start, start + 1, start + 2) A(start, start + 1, start + 3) … A(start, start + 1, end + 1) | + | 0 A(start, start + 2, start + 3) … A(start, start + 2, end + 1) | + A = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … A(start, end, end + 1) | + + Matrices B and C follow the same index-to-value pattern. + + Matrices A, B, and C are used to compute values of function `Q` recursively, without + recomputing the same sums over and over again. The recursive formulas were further + transformed to closed forms, so they can be computed using numpy cummulative sum + function to take advantage of numpy vectorized operations. (Note that each column + in matrices A, B, C can be repersented through np.cumsum). The formulas use + commulative sum of functions `H` and `V`, which definitions can be found in + `_calculate_distances` method. + + A(start, τ, κ) = 2 / (κ - s) * (Σt=start,τ-1 H(start, t, κ) - Σt=start+1,τ-1 V(start, t, κ)), + + B(start, τ, κ) = 2 * (κ - τ) / (κ - start) / (τ - start - 1) * Σt=start+1,τ-1 V(start, t, κ), + + C(start, τ, κ) = 2 * (τ - start) / (κ - start) / (κ - τ - 1) * Σt=τ,κ-2 H(start, t, κ).''' + if self.V is None or self.H is None: + self._calculate_distances() + + V = self.V[start : end] - self.distances[0 : start, start + 1 : end + 1].sum(axis=0) + H = self.H[start : end, start : end] + + taus = np.arange(start + 1, end + 1)[:, None] + kappas = np.arange(start + 2, end + 2)[None, :] + + A = np.zeros((end - start, end - start)) + A_coefs = 2 / (kappas - start) Review Comment: Just for discussion... but it is my opinion that these coefficient are added to the formula only to pass some robustness "almost certainly at the limit" proof. They are never really explained or justified the way every other term is. They have the effect of muting the q values at the ends of a interval, and significantly inflating those in the midldle. As a result, given many good candidates q, the algorithm tends to pick change points first in the middle of a series. Anyway, it's for the future, but to me those are free game once we want to make any mods to this implementation. ########## otava/change_point_divisive/calculator.py: ########## @@ -0,0 +1,189 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +import numpy as np +from numpy.typing import NDArray + +from otava.change_point_divisive.base import Calculator, CandidateChangePoint + + +class PairDistanceCalculator(Calculator): + def __init__(self, series: NDArray, power: int = 1): + super().__init__(series) + assert 0 < power < 2, f"power={power} isn't in (0, 2)" + self.power = power + self.V = None + self.H = None + + def _calculate_distances(self): + '''Precomputes `H` and `V` functions that are used for computation of `Q` function. + See more details about `Q` function in `get_candidate_change_point` method. + See more details about the use of `H` and `V` functions in `_get_A_B_C` method. + + Let matix `distances` be a matrix of all possible L1 (Manhattan) distances between + elements of 1d array `series`. If `D(i, j) = |series[i] - series[j]|` and `N = len(series)`, + then + + | D(0, 0) D(0, 1) … D(0, N - 1) | + | D(1, 0) D(1, 1) … D(1, N - 1) | + distances = | ⋮ ⋮ ⋱ ⋮ | . + | D(N - 1, 0) D(N - 1, 1) … D(N - 1, N - 1) | + + Note that the main diagonal of the matrix is filled with 0. + + + We define functions `V` and `H` as the following (V)ertical and (H)orizontal + cummulative sums: + V(start, τ, κ) = Σ distances[start:τ, τ], + H(start, τ, κ) = Σ distances[τ, τ:κ], + for `start < τ < κ <= N`. + Note that `V(start, τ, κ) = V(start, τ)` and `H(start, τ, κ) = H(τ, κ)`. + + + Vector V contains the following values of function `V(start, τ, κ)`: + V = [V(0, 1, κ) V(0, 2, κ), … V(0, N - 1, κ)]. + `V(0, 0, κ) = 0` so they are ommited. As noted, function `V(start, τ, κ)` + does not depend on `κ`. Therefore, vector V contains all values of function + `V(0, τ, κ)`. We can easily get values of function `V` for an arbitrary + value of `start` by + V(start, τ, κ) = V(0, τ, κ) - Σ distances[0 : start, τ]. + Note: The reason not all values of `V(start, τ, κ)` are precomputed is that + we do not need them all. The values of `start` will depend on critical + points we find. Precomuting them for all possible values is a waste. + + + Matrix H contains the following values of function `H(start, τ, κ)`: + + | H(start, 0, 2) H(start, 0, 3) … H(start, 0, N) | + | 0 H(start, 1, 3) … H(start, 1, N) | + H = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … H(start, N - 2, N) | + + `H(start, x, x) = H(start, x - 1, x) = 0` for any `x` so they are ommited. + As noted, function `H(start, τ, κ)` does not depend on `start`. Therefore, + matrix H contains all possible values of function `H`. + Note: We precomputed all values of `H(start, τ, κ)` because all of them are needed + for the very first iteration (`start=0` and `end=N-1`).''' + self.distances = np.power(np.abs(self.series[:, None] - self.series[None, :]), self.power) + triu = np.triu(self.distances, k=1)[:-1, 1:] + self.V = triu.sum(axis=0) + self.H = triu.cumsum(axis=1) + + def _get_Q_vals(self, start: int, end: int) -> NDArray: + '''Computes matrices A, B, C where all possible values of function `Q` are + given by matrix Q = A - B - C. + See more details about `Q` function in `get_candidate_change_point` method. + + For any given `start` and `end` let + Q(τ, κ) = A(start, τ, κ) - B(start, τ, κ) - C(start, τ, κ), + where `start < τ < κ <= end + 1`. (For definitions of `A`, `B`, `C` see + formula of `Q` in `get_candidate_change_point` method.) All possible values of function + `A` are given by matrix + + | A(start, start + 1, start + 2) A(start, start + 1, start + 3) … A(start, start + 1, end + 1) | + | 0 A(start, start + 2, start + 3) … A(start, start + 2, end + 1) | + A = | ⋮ ⋮ ⋱ ⋮ | . + | 0 0 … A(start, end, end + 1) | + + Matrices B and C follow the same index-to-value pattern. + + Matrices A, B, and C are used to compute values of function `Q` recursively, without + recomputing the same sums over and over again. The recursive formulas were further + transformed to closed forms, so they can be computed using numpy cummulative sum + function to take advantage of numpy vectorized operations. (Note that each column + in matrices A, B, C can be repersented through np.cumsum). The formulas use + commulative sum of functions `H` and `V`, which definitions can be found in + `_calculate_distances` method. + + A(start, τ, κ) = 2 / (κ - s) * (Σt=start,τ-1 H(start, t, κ) - Σt=start+1,τ-1 V(start, t, κ)), + + B(start, τ, κ) = 2 * (κ - τ) / (κ - start) / (τ - start - 1) * Σt=start+1,τ-1 V(start, t, κ), + + C(start, τ, κ) = 2 * (τ - start) / (κ - start) / (κ - τ - 1) * Σt=τ,κ-2 H(start, t, κ).''' + if self.V is None or self.H is None: + self._calculate_distances() + + V = self.V[start : end] - self.distances[0 : start, start + 1 : end + 1].sum(axis=0) + H = self.H[start : end, start : end] + + taus = np.arange(start + 1, end + 1)[:, None] + kappas = np.arange(start + 2, end + 2)[None, :] + + A = np.zeros((end - start, end - start)) + A_coefs = 2 / (kappas - start) + A[1:, :] = np.cumsum(V)[:-1, None] + A = A_coefs * np.triu(np.cumsum(H, axis=0) - A, k=0) + + B = np.zeros((end - start, end - start)) + B_num = 2 * (kappas - taus) + B_den = (kappas - start) * (taus - start - 1) + B_mask = np.triu(np.ones_like(B_den, dtype=bool), k=0) + B_out = np.zeros_like(B_den, dtype=float) + B_coefs = np.divide(B_num, B_den, out=B_out, where=B_mask & (B_den != 0)) + B[1:, 1:] = B_coefs[1:, 1:] * np.cumsum(V)[:-1, None] + + C = np.zeros((end - start, end - start)) + C_num = 2 * (taus - start) + C_den = (kappas - start) * (kappas - taus - 1) + C_mask = np.triu(np.ones_like(C_den, dtype=bool), k=1) + C_out = np.zeros_like(C_den, dtype=float) + C_coefs = np.divide(C_num, C_den, out=C_out, where=C_mask & (C_den != 0)) + C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0)) + + # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`. + # So, critical point is `τ = i + 1`. Review Comment: If you don't have time to do that in the near future, the shifting of indexes can be a separat followup task too. ########## otava/change_point_divisive/base.py: ########## @@ -0,0 +1,109 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from dataclasses import dataclass, fields +from typing import Generic, List, Optional, TypeVar + +from numpy.typing import NDArray + + +@dataclass +class CandidateChangePoint: + '''Candidate for a change point. The point that maximizes Q-hat function on [start:end+1] slice''' + index: int + qhat: float + + +@dataclass +class BaseStats: + '''Abstract statistics class for change point. Implementation depends on the statistical test.''' + pvalue: float + + +GenericStats = TypeVar("GenericStats", bound=BaseStats) Review Comment: For those of us not so good at python, can you add a short comment what this does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
