Hey Alex,

Thanks for the clarification!

> If the DUT is processing frame-by-frame we cannot expect reliable comparison 
> to other detectors via NAB scoring, although your modification may help. One 
> issue with the mod, however, is if you shift anomaly scores of 1 outside of 
> the window, these become FPs. For example, if the full frame (125 data 
> points) is initially within a window, it's possible your mod changes the 
> score from 1 TP to 1 TP + 62 FPs.

I  have to agree with your assessment especially for smaller windows where the 
shift will always drag the anomalies outside the ground truth window. One more 
approach I’d like to test is to only chose the end of the frame as the recorded 
anomaly. I’ll update you on the results of that as soon as I have them. This 
should eliminate the effect of the ratio of ground truth window size to DUT 
frame size has on the scoring process.

> Addressing your listed concerns:
> 1. This brings up an important point I should have mentioned earlier; my 
> apologies. The score normalization method [1] assumes there are 44 TPs in the 
> dataset, and also that the Baseline detector has run.

I’m glad you mentioned this, I was unaware the normalization step made those 
assumptions. I assume I’m better off comparing the raw scores instead, correct?

> 2. It is okay for the metrics' counts to vary for different application 
> profiles. For a given DUT, the optimization step calculates the best 
> threshold -- i.e. likelihood value above which a data point is anomalous -- 
> for each application profile, where the best threshold is that which 
> maximizes the score. Thus, consider the application profile "Rewards Low FP 
> Rate". The optimal threshold for this profile will likely be higher than that 
> of the other profiles because then the DUT outputs fewer detections, which 
> likely results in fewer FPs.

So different thresholds based on application profile will result in different 
predicted anomalies and therefore different counts of TP,TN,FN, and FP. Got it!

> 3. The issue with your mod I mentioned above, and to a lesser extent the 
> normalization method from (1), may explain the results here. What confusion 
> matrix are you calculating? Is this post-processing you're doing on the 
> results? If so, I'm sure myself and others would be interested in seeing it.

I’m referring to the confusion matrix in the results csv that the scorer 
produces i.e. the TP,TN,FN,FP counts with precision being TP/(TP+FP) and recall 
being TP/(TP+FN). For example, for the standard profile:

Numenta: 33% precision - 0.5% recall - score: -20
DUT: 17% precision - 78% recall - score: -563

This seems problematic. 

Best,
Nick

Reply via email to