I am +1 on this feature. It opens the door to true statistical baselining. The key motivating use case for me is as follows:
Lets say I am looking at the number of flows originating from my server A to external assets (anything not on my network) on a Tuesday at 1pm. I want to figure out what number or range of A-> external flows constitutes normal. I would query every bin on a Tuesday a 1pm for the last 5 Tuesdays, figure out 25/50/75% values are for these bins and I would know (a) my 'normal' range and (b) if what I have currently is an anomaly. 23.01.2017, 13:01, "Casey Stella" <ceste...@gmail.com>: > Hi All, > > I'm planning to expand the capabilities of PROFILE_GET and wanted to pass > an idea past the community. > > *Current State* > > Currently, the functionality of PROFILE_GET is fairly straightforward: > > - profile - The name of the profile. > - entity - The name of the entity. > - durationAgo - How long ago should values be retrieved from? > - units - The units of 'durationAgo'. > - groups_list - Optional, must correspond to the 'groupBy' list used in > profile creation - List (in square brackets) of groupBy values used to > filter the profile. Default is the empty list, meaning groupBy was not used > when creating the profile. > - config_overrides - Optional - Map (in curly braces) of name:value > pairs, each overriding the global config parameter of the same name. > Default is the empty Map, meaning no overrides. > > This has the advantage of providing a relatively simple mechanism to > support the dominant use-case, gathering the profiles for a trailing > window. The issues, however, are a couple: > > - We may need more complex semantics for specifying the window > (motivated below) > - As such, this couples the gathering of the profiles with the > specification of the window. > > I propose to decouple these two concepts. I propose that we extract the > notion of the lookback into a separate, more featureful function called > PROFILE_LOOKBACK() which could be composed with an adjusted PROFILE_GET, > whose arguments look like: > > - profile - The name of the profile. > - entity - The name of the entity. > - timestamps - The list of timestamps to retrieve > - groups_list - Optional, must correspond to the 'groupBy' list used in > profile creation - List (in square brackets) of groupBy values used to > filter the profile. Default is the empty list, meaning groupBy was not used > when creating the profile. > - config_overrides - Optional - Map (in curly braces) of name:value > pairs, each overriding the global config parameter of the same name. > Default is the empty Map, meaning no overrides. > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it as > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity', > PROFILE_LOOKBACK(...)) ). > > *Motivation for Change* > > The justification for this is that sometimes you want to compare time bins > for a long duration back, but you don't want to skew the data by including > periods that aren't distributionally similar (due to seasonal data, for > instance). You might want to compare a value to statistically baseline of > the median of the values for the same time window on the same day for the > last month (e.g. every tuesday at this time). > > Also, we might want a trailing window that does not start at the current > time (in wall-clock), but rather starts an hour back or from the time that > the data was originally ingested. > > *PROFILE_LOOKBACK* > > I propose that we support the following features: > > - A starting point that is not current time > - Sparse bins (i.e. the last hour for every tuesday for the last month) > - The ability to skip events (e.g. weekends, holidays) > > This would result in a new function with the following arguments: > > - > > from - The lookback starting point (default to now) > - > > fromUnits - The units for the lookback starting point > - > > to - The ending point for the lookback window (default to from + binSize) > - > > toUnits - The units for the lookback ending point > - > > including - A list of conditions which we would skip. > - weekend > - holiday > - sunday through saturday > - > > excluding - A list of conditions which we would skip. > - weekend > - holiday > - sunday through saturday > - > > binSize - The size of the lookback bin > - > > binUnits - The units of the lookback bin > > Given the number of arguments and their complexity and the fact that many, > many are optional, I propose that either > > - PROFILE_LOOKBACK take a Map so that we can get essentially named > params in stellar. > - PROFILE_LOOKBACK accept a string backed by a DSL to express these > criteria > > Ok, so that's a lot to take in. How about we look at some motivating > use-cases. > > *Base Case: A lookback of 1 hour ago* > > As a map, this would look like: > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } ) > > As a DSL this would look like: > PROFILE_LOOKBACK( '1 hour bins from now') > > *The same time window every tuesday for the last month starting one hour > ago* > > Just to make this as clear as possible, if this is run at 3PM on Monday > January 23rd, 2017, it would include the following bins: > > - January 17th, 2PM - 3PM > - January 10th, 2PM - 3PM > - January 3rd, 2PM - 3PM > - December 27th, 2PM - 3PM > > As a map, this would look like: > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits' > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' : 'HOURS' > } ) > > As a DSL this would look like: > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays') > > *The same time window every sunday for the last month starting one hour ago > skipping holidays* > > Just to make this as clear as possible, if this is run at 3PM on Monday > January 22rd, 2017, it would include the following bins: > > - January 16th, 2PM - 3PM > - January 9th, 2PM - 3PM > - January 2rd, 2PM - 3PM > - NOT December 25th > > As a map, this would look like: > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits' > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ], > 'binSize' : 1, 'binUnits' : 'HOURS' } ) > > As a DSL this would look like: > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays > excluding holidays') > > *DSL vs API* > > So, here's my personal rundown of the two approaches: > > DSL: > > - PRO > - Clear. As you can see, it reads like a sentence > - Concise > - CON: > - More complex to implement > - Another DSL to learn > > API: > > - PRO > - Simpler to implement (though marginally so, IMO) > - CON > - A bit more complex to understand (also, IMO) > > I'd like to solicit feedback from the community at this point: > > - What do you think of this change? > - Would you prefer the DSL, API or other approach? > > Thanks, > > Casey ------------------- Thank you, James Sirota PPMC- Apache Metron (Incubating) jsirota AT apache DOT org