That was a lot to digest so I apologize if I have missed some of your thought process. I probably need to read through this a couple more times.
Can anyone specify a better way to do Sliding Window profiles correctly > with current functionality? Could we not change the underlying implementation of the profiler to support sliding windows? Ideally the same profile definition could be applied to either a tumbling or sliding window, rather than requiring a change in the profile definition itself. The work that I had done for METRON-590, made it possible to configure either a tumbling or sliding window. The downside of that work is that it was an all-or-nothing change, all profiles (in the same topology) were either tumbling or sliding. But perhaps we can enhance it from there. A much more elegant solution would be to allow “result” to write two > objects. Yes, I like this! I have had the same thought myself. This would be a simple change and would provide the user with some flexibility. Does this change solve the entire problem though? Obviously, I do need to sit down and think on this a bit more. I really wish we could handle these use cases with a profile definition that remains simple and easy to grok. Thanks for spending the time to think through this. On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <ma...@apache.org> wrote: > Hi all, > > Casey and I had an interesting chat yesterday, during which we agreed that > the example code for Outlier Analysis in https://github.com/apache/ > incubator-metron/blob/master/metron-analytics/metron-statistics/README.md > and the revised example code in https://issues.apache.org/ > jira/browse/METRON-668 (as of 23 January) both do not correctly implement > the desired Sliding Window model. This email gives the argument for why, > and proposes a couple ways to do it right. Your input and preferences are > requested. > > > > First a couple statements about the STATS object that underlies most > interesting Profile work: > > · The STATS object is a t-digest. It summarizes a set of data > points, such as those received during a sampling period, in a way that is > nominally O(1) regardless of the input number of data points, and preserves > the info about the “shape” of the statistical distribution of those > points. Not only info about averages and standard deviations, but also > about medians and percentiles (which, btw, is a very different kind of > information), is preserved and can be calculated correctly to within a > given error epsilon. Since it is a summary, however, time information is > lost. > > · STATS objects, these digests of sampling periods, are MERGEABLE, > meaning if you have a digest from time(1) to time(2), and another digest > from time(2) to time (3), you can merge them and get a digest that is > statistically equivalent to a digest from time(1) to time(3) continuously. > > · They are sort of idempotent, in that if you take two identical > digests and merge them, you get almost the same object. However, the > result object will be scaled as summarizing twice the number of input data > points. > > · Which is why it DOESN’T work to try to merge overlapping > sampling periods. To give a crude example, if you have a digest from > time(1) to time(3) and another digest from time(2) to time(4), and merge > them, the samples from time(2) to time(3) will be over-represented by a > factor of 2x, which should be expected to skew the distribution (unless the > distribution really is constant for all sub-windows – which would mean we > don’t need Sliding Windows because nothing changes). > > > > The Outlier Analysis profiles linked above try to implement a sliding > window, in which each profile period summarizes the Median Absolute > Deviation distribution of the last five profile periods only. An “Outlier > MAD Score” can then be determined by comparing the deviation of a new data > point to the last MAD distribution recorded in the Profile. This allows > for changes over time in the statistical distribution of inputs, but does > not make the measure unduly sensitive to just the last minute or two. This > is a typical use case for Sliding Windows. > > > > Both example codes trip on how to do the sliding window in the context of > a Profile. At sampling period boundaries, both either compose the “result” > digest or initialize the “next” digest by reading the previous 5 result > digests. That is wrong, because it ignores the fact that those digests > aren’t just for their time periods. They too were composed with THEIR > preceding 5 digests, each of which were composed with their preceding 5 > digests, which in turn… etc. The end result is sort of like the way > Madieras or some brandies are aged via the Solera process with fractional > blending. You don’t get a true sliding window, which sharply drops the > past, you get a continuous dilution of the past. In fact, it’s wrong to > assume that the “tail” of the far past is more diluted than the near past! > – It would be with some algorithms, but the alg used in these two examples > causes the far past to become an exponentially MORE important fraction of > the overall data than the near past – much worse than simply turning on > digesting at time(0) and leaving it on, with no attempt at windowing. > (Simulate it in a spreadsheet and you’ll see.) > > > > We need a Profiler structure that assists in creating Sliding Window > profiles. The problem is that Profiles let you save only one thing > (quantity or object) per sampling period, and that’s typically a different > “thing” (object type or scale) than you want to use to compose the result > for each windowed span. One way to do it correctly would be with two > Profiles, like this: > > > > (SOLUTION A) > > { > > "profiles": [ > > { > > "profile": "sketchy_mad", > > "foreach": "'global'", > > "init" : { > > "s": "OUTLIER_MAD_INIT()" > > }, > > "update": { > > "s": "OUTLIER_MAD_ADD(s, value)" > > }, > > "result": "s" > > }, > > { > > "profile": "windowed_mad", > > "foreach": "'global'", > > "init" : { }, > > "update": { }, > > "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad', > > 'global', 5, 'MINUTES'))" > > } > > ] > > } > > > > This is typical. You have a fine-grain sampling period that you want to > “tumble”, and a broader window that you want to “slide” or “roll” along the > underlying smaller sampling periods. > > The above pair of profiles is clear enough, but not real efficient. Also, > you can’t guarantee they’ll stay exactly synchronized. It would be better > to combine them somehow. > > > > In these examples, we only need to look back 5 sampling periods, so the > following would work (NOTE: requires Casey’s PR#668 that preserves state > between sampling periods): > > > > (SOLUTION B) > > { > > "profiles": [ > > { > > "profile": "windowed_mad", > > "foreach": "'global'", > > "init" : { > > "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()", > > "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()", > > "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()", > > "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()", > > "s0": "OUTLIER_MAD_INIT()" > > }, > > "update": { > > "s0": "OUTLIER_MAD_ADD(s0, value)" > > }, > > "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)” > > } > > ] > > } > > > > This works fine for a sliding window of size 5 sample periods. But what > if it was 15? Or 60? > > No doubt we could implement a ring buffer with a stellar list object, but > it’s gonna get ugly(er) and slow. > > > > A much more elegant solution would be to allow “result” to write two > objects. The result could be viewed either as two Profiles, or as a > Profile with multiple addressable columns of hbase output. For instance: > > (SOLUTION C) > > { > > "profiles": [ > > { > > "profile": "[ 'sketchy_mad', 'windowed_mad' ]", > > "foreach": "'global'", > > "init" : { > > "s": "OUTLIER_MAD_INIT()" > > }, > > "update": { > > "s": "OUTLIER_MAD_ADD(s, value)" > > }, > > "result": { > > "sketchy_mad": "s", > > "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s, > PROFILE_GET('sketchy_mad', > > 'global', 4, 'MINUTES'))" > > } > > } > > ] > > } > > > > This is clean and tight, and I think will be easily generalized to any > Sliding Window scenario. > > BTW, given that each write has to work it’s way through more processing > before getting to HBase, it’s best to say that all writes happen AFTER all > the “result” processing, which is why I showed the MERGE above retrieving > only 4 sampling periods prior to the current one, and then add the current > one from local state. > > > > What do you think? Can anyone specify a better way to do Sliding Window > profiles correctly with current functionality? > > > > > > Casey noted that if this feature were available, he would like to consider > separating the value distribution from the deviation distribution, in the > OUTLIER_MAD state objects. He also notes that the hypothetical > OUTLIER_MAD_INIT() function needs to take a starting median, as argument. > So the final Sliding Window Profile for OUTLIER_MAD state tracking might > look more like: > > > > { > > "profiles": [ > > { > > "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]", > > "foreach": "'global'", > > "init" : { > > "median": "STATS_PERCENTILE(STATS_MERGE(PROFILE_GET('sketchy_values', > 'global', 5, 'MINUTES')), 50)", > > "s": "OUTLIER_MAD_INIT(median)", > > "val_dist": "STATS_INIT()" > > }, > > "update": { > > "val_dist": "STATS_ADD(val_dist, value)", > > "s": "OUTLIER_MAD_ADD(s, value)" > > }, > > "result": { > > "sketchy_values": "val_dist", > > "sketchy_mad": "s", > > "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s, > PROFILE_GET('sketchy_mad', > > 'global', 4, 'MINUTES'))" > > } > > }, > > ] > > } > > > > Thanks, > > --Matt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Nick Allen <n...@nickallen.org>