First post on this mailing list.

I have been working with time series data for a project, and thought I could contribute a new transformer to segment time series data using a sliding window, with variable overlap. I have attached demonstration of how this would fit in the existing framework. The only challenge for me here is that the transformer needs to transform both the X and y variable in order to perform the segmentation. I am not sure from the documentation how to implement this in the framework.

Overlapping segments is a great way to boost performance for time series classifiers, so this may be a worthwhile contribution for some in this area of ML. Ultimately, model_selection.TimeSeries.Split would need to be modified to support overlapping segments, or a new class created to enable validation for this.

Please let me know if this would be a worthwhile contribution, and if so how to go about transforming the target vector y in the framework / pipeline?

Thanks!

David Burns



from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class TimeSeriesSegment(BaseEstimator, TransformerMixin):
    '''
    Segments time series data with sliding window and variable step size / overlap
    '''

    def __init__(self, width, step = None):
        '''
        :param width: sliding window length (time points), integer > 0
        :param step: number of time steps between windows, integer > 0
            if step < width, segments overlap
            if step > width, there is a gap between segments
            if step == width, a sliding window is created with no overlap or gaps
        '''


        assert width > 1
        self.width = width
        if step == None:
            step = width
        else:
            assert step > 1
        self.step = step

    def fit(self, X, y):
        '''
        :param X: numpy array shape (N,)
            each element is an array shape (Ti, D) corresponding to a time series of variable length Ti and dimension D
            D must be the same for all series
        :param y: target vector numpy array shape (N,)
        '''

        # checking input shape
        assert X.shape[0] == y.shape[0]
        N = X.shape[0]
        shapes = np.array([X[i].shape for i in np.arange(N)])
        assert len(np.unique(shapes[:,1])) == 1

        Xs = []
        ys = []

        for i in np.arange(N):
            Xs.append(self._segment(X[i]))
            ys.append(np.full(Xs[i].shape[0], y[i]))

        self.Xs = np.concatenate(Xs)
        self.ys = np.concatenate(ys)

        return self


    def transform(self, X):
        '''
        :param X: numpy array shape (N,)
            each element is an array shape (Ti, D) corresponding to a time series of variable length Ti and dimension D
            D must be the same for all series
        :return:
            Xs: segmented temporal tensor shape (Nw, width, D)
            ys: target shape (Nw,)
        '''
        ### todo: need to be able to alter ys in the pipeline
        return self.Xs, self.ys

    def _segment(self, X):
        '''
        sliding window segmentation on tensor
        :param X: numpy array shape (Ti, D)
        :return: temporal tensor shape (Nw, width, D)
        '''
        Xs = []
        for j in range(X.shape[1]):
            Xs.append(self._sliding_window(X[:,j]))  # each item is NxW, list length D

        return np.stack(Xs, axis=2)

    def _sliding_window(self, x):
        '''
        sliding window segmentation on vector
        :param x: vector numpy array shape (Ti,)
        :return: array shape (Nw, width)
        '''
        w = np.hstack(x[i:1 + i - self.width or None:self.step] for i in range(0, self.width))
        return w.reshape((int(len(w) / self.width), self.width), order='F')


def main():
    X = np.array([np.array([[1, 2], [3, 4], [5, 6], [6, 7]]),
                  np.array([[2, 4], [6, 8], [10, 12], [12, 14], [18, 20]])], dtype=object)
    y = np.array([0,1])

    segmenter = TimeSeriesSegment(2,2)
    segmenter = segmenter.fit(X, y)

    Xs, ys = segmenter.transform(X)

    print("Input time series data: \n", X, "\n\n")
    print("Input target vector: \n", y, "\n\n")
    print("Segmented time series: \n ", Xs, "\n\n")
    print("Segmented target vector: \n ", ys, "\n\n")


if __name__ == '__main__':
    main()
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to