cyrilou242 opened a new issue #6157:
URL: https://github.com/apache/incubator-pinot/issues/6157
## System
Thirdeye built from 5f309cff34da802616d92ed361284d95b35ed9b9.
BigQuery as a data source. (should not have any impact here)
## Context
My data ingestion system has a 3 hours delay, and has a daily granularity,
so I wanted to check how the timeframes (especially the upper limit) were
computed for anomaly detection.
In my case below, the upper date limit seems to be wrong.
### Detection config:
Dataset config (shortened):
```yaml
"dataset": "dataset",
"dimensions": ["dimension1"],
"timeColumn": "date",
"timeUnit": "DAYS",
"timeDuration": 1,
"timeFormat": "yyyyMMdd",
"timezone": "UTC",
"dataSource": "SqlThirdEyeDataSource",
"realtime": false,
"expectedDelay": {
"size": 24,
"unit": "HOURS"
}
```
Detection config
```yaml
detectionName: 'detection1'
description: '...'
metric: metric1
dataset: dataset
cron: "0 0 18 ? * * *"
dimensionExploration:
dimensions:
- dimension1
rules:
- detection:
- name: 30percentchange
type: PERCENTAGE_RULE
params:
offset: wo1w
percentageChange: 0.30
pattern: UP_OR_DOWN
```
## Detection job run
The following queries were run for the detection job, around October 13th,
18pm UTC:
On 2020-10-13 18:00:23.019159 UTC:
```sql
SELECT MAX(date) FROM dataset
```
I guess the objective of this query is to get the most recent data time in
the db.
Because my db has a 3 hours delay, the result of the query is: `2020-10-13`
On 2020-10-13 18:00:25.735550 UTC:
```sql
SELECT dimension1, COUNT(metric1)
FROM dataset
WHERE UNIX_SECONDS(date) BETWEEN 1602007200 AND 1602612000
GROUP BY dimension1
LIMIT 100000
```
I guess the objective of this query is to get the different values of
`dimension1`, because my detection is configured to split by `dimension1`.
The result of the query is [`value1`, `value2`,`value3`]
Notice that here the endTime is 1602612000 = 13 October 2020 18:00:00, which
seems to corresponds to the cron time.
Finally on 2020-10-13 18:00:30.041758 UTC:
```sql
SELECT date, COUNT(metric1)
FROM dataset
WHERE UNIX_SECONDS(date) BETWEEN 1594684800 AND 1602633600
AND metric1 IN ('value1')
GROUP BY date
LIMIT 100000
```
The objective of this query is to get the data for the anomaly detection.
Here the endTime is 1602633600 = 14 October 2020 00:00:00 !
### Problem
- 14 October is bigger than the observed max date
- 13 October would not be correct, because at the time the job is running
(6pm on 13 October), the data is not complete for 13 October, so it does not
make sense to compare it to previous daily data (between is inclusive)
--> The correct value I was expecting was 12 October.
This problem (if the implementation is the same) may not be specific to Days
granularity or date-precision-only column:
- for hours granularity: computing at 16:30 the 16:00 metric will have the
same problem
- for datetime column: in my example, most_recent_data_time would have been
2020-10-13 15:00:00 --> same problem after I guess
Behavior I would expect:
something like unix_millis(floorBy(most_recent_data_time, timeUnit)) -1
with floorBy a function rounding to the closest smaller timeUnit, and -1
because the between clause is inclusive.
eg1:
most_recent_data_time=13 October, granularity is Days:
--> return 12 October 23:59:59.999
eg2:
most_recent_data_time=13 October 16:30, granularity is Hours:
--> return 13 October 15:59:59.999
Did I miss something in the config ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]