Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET

James Sirota Mon, 23 Jan 2017 12:57:48 -0800

I am +1 on this feature.  It opens the door to true statistical baselining.  
The key motivating use case for me is as follows:


Lets say I am looking at the number of flows originating from my server A to 
external assets (anything not on my network) on a Tuesday at 1pm.  I want to 
figure out what number or range of A-> external flows constitutes normal.  I 
would query every bin on a Tuesday a 1pm for the last 5 Tuesdays, figure out 
25/50/75% values are for these bins and I would know (a) my 'normal' range and 
(b) if what I have currently is an anomaly.  

23.01.2017, 13:01, "Casey Stella" <ceste...@gmail.com>:
> Hi All,
>
> I'm planning to expand the capabilities of PROFILE_GET and wanted to pass
> an idea past the community.
>
> *Current State*
>
> Currently, the functionality of PROFILE_GET is fairly straightforward:
>
>    - profile - The name of the profile.
>    - entity - The name of the entity.
>    - durationAgo - How long ago should values be retrieved from?
>    - units - The units of 'durationAgo'.
>    - groups_list - Optional, must correspond to the 'groupBy' list used in
>    profile creation - List (in square brackets) of groupBy values used to
>    filter the profile. Default is the empty list, meaning groupBy was not used
>    when creating the profile.
>    - config_overrides - Optional - Map (in curly braces) of name:value
>    pairs, each overriding the global config parameter of the same name.
>    Default is the empty Map, meaning no overrides.
>
> This has the advantage of providing a relatively simple mechanism to
> support the dominant use-case, gathering the profiles for a trailing
> window. The issues, however, are a couple:
>
>    - We may need more complex semantics for specifying the window
>    (motivated below)
>    - As such, this couples the gathering of the profiles with the
>    specification of the window.
>
> I propose to decouple these two concepts. I propose that we extract the
> notion of the lookback into a separate, more featureful function called
> PROFILE_LOOKBACK() which could be composed with an adjusted PROFILE_GET,
> whose arguments look like:
>
>    - profile - The name of the profile.
>    - entity - The name of the entity.
>    - timestamps - The list of timestamps to retrieve
>    - groups_list - Optional, must correspond to the 'groupBy' list used in
>    profile creation - List (in square brackets) of groupBy values used to
>    filter the profile. Default is the empty list, meaning groupBy was not used
>    when creating the profile.
>    - config_overrides - Optional - Map (in curly braces) of name:value
>    pairs, each overriding the global config parameter of the same name.
>    Default is the empty Map, meaning no overrides.
>
> So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it as
> its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
> PROFILE_LOOKBACK(...)) ).
>
> *Motivation for Change*
>
> The justification for this is that sometimes you want to compare time bins
> for a long duration back, but you don't want to skew the data by including
> periods that aren't distributionally similar (due to seasonal data, for
> instance). You might want to compare a value to statistically baseline of
> the median of the values for the same time window on the same day for the
> last month (e.g. every tuesday at this time).
>
> Also, we might want a trailing window that does not start at the current
> time (in wall-clock), but rather starts an hour back or from the time that
> the data was originally ingested.
>
> *PROFILE_LOOKBACK*
>
> I propose that we support the following features:
>
>    - A starting point that is not current time
>    - Sparse bins (i.e. the last hour for every tuesday for the last month)
>    - The ability to skip events (e.g. weekends, holidays)
>
> This would result in a new function with the following arguments:
>
>    -
>
>    from - The lookback starting point (default to now)
>    -
>
>    fromUnits - The units for the lookback starting point
>    -
>
>    to - The ending point for the lookback window (default to from + binSize)
>    -
>
>    toUnits - The units for the lookback ending point
>    -
>
>    including - A list of conditions which we would skip.
>    - weekend
>       - holiday
>       - sunday through saturday
>    -
>
>    excluding - A list of conditions which we would skip.
>    - weekend
>       - holiday
>       - sunday through saturday
>    -
>
>    binSize - The size of the lookback bin
>    -
>
>    binUnits - The units of the lookback bin
>
> Given the number of arguments and their complexity and the fact that many,
> many are optional, I propose that either
>
>    - PROFILE_LOOKBACK take a Map so that we can get essentially named
>    params in stellar.
>    - PROFILE_LOOKBACK accept a string backed by a DSL to express these
>    criteria
>
> Ok, so that's a lot to take in. How about we look at some motivating
> use-cases.
>
> *Base Case: A lookback of 1 hour ago*
>
> As a map, this would look like:
>
> PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
>
> As a DSL this would look like:
> PROFILE_LOOKBACK( '1 hour bins from now')
>
> *The same time window every tuesday for the last month starting one hour
> ago*
>
> Just to make this as clear as possible, if this is run at 3PM on Monday
> January 23rd, 2017, it would include the following bins:
>
>    - January 17th, 2PM - 3PM
>    - January 10th, 2PM - 3PM
>    - January 3rd, 2PM - 3PM
>    - December 27th, 2PM - 3PM
>
> As a map, this would look like:
>
> PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
> : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' : 'HOURS'
> } )
>
> As a DSL this would look like:
> PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays')
>
> *The same time window every sunday for the last month starting one hour ago
> skipping holidays*
>
> Just to make this as clear as possible, if this is run at 3PM on Monday
> January 22rd, 2017, it would include the following bins:
>
>    - January 16th, 2PM - 3PM
>    - January 9th, 2PM - 3PM
>    - January 2rd, 2PM - 3PM
>    - NOT December 25th
>
> As a map, this would look like:
>
> PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
> : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ],
> 'binSize' : 1, 'binUnits' : 'HOURS' } )
>
> As a DSL this would look like:
> PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays
> excluding holidays')
>
> *DSL vs API*
>
> So, here's my personal rundown of the two approaches:
>
> DSL:
>
>    - PRO
>    - Clear. As you can see, it reads like a sentence
>       - Concise
>    - CON:
>       - More complex to implement
>       - Another DSL to learn
>
> API:
>
>    - PRO
>       - Simpler to implement (though marginally so, IMO)
>    - CON
>       - A bit more complex to understand (also, IMO)
>
> I'd like to solicit feedback from the community at this point:
>
>    - What do you think of this change?
>    - Would you prefer the DSL, API or other approach?
>
> Thanks,
>
> Casey

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET

Reply via email to