Kevin Marlis created SDAP-406: --------------------------------- Summary: Time series comparison stats issues Key: SDAP-406 URL: https://issues.apache.org/jira/browse/SDAP-406 Project: Apache Science Data Analytics Platform Issue Type: Bug Components: analysis Reporter: Kevin Marlis
{*}In short{*}: the time series comparison stats only compute the linear regression for the results that have sync'd up times. ex: DS1 and DS2 are both monthly products, but DS1 data falls on the first of the month and DS2 falls on the middle of the month. With no matching times across the two datasets, none of the algorithm results data gets provided to the regression algorithm. {*}In detail{*}: The issue is at this line: [https://github.com/apache/incubator-sdap-nexus/blob/22b10f661f02e4b8329e3973234b83b188133d8c/analysis/webservice/algorithms_spark/TimeSeriesSpark.py#L314] {{}} {{`xy`}} is appended to if there are 2 dictionaries of results in `{{{}item`{}}}. That only happens if there are two identical time values between the two datasets. The linear regression algorithm will return nans if x and y arrays only contain one value, which can be problematic downstream. The xs and ys for the regression never get appended to because the dates never sync up ({{{}if len(item) == 2{}}} is never satisfied). Empty comparison stats don't appear to cause an impact to the charts on the frontend. {*}Possible fixes...{*}{*}{*} * check if lin regression results are nan, if so set stats to empty dict * Date normalization to make the time steps consistent across multiple datasets For now we're going with the first option, although the second option could be looked into. -- This message was sent by Atlassian Jira (v8.20.10#820010)