Hey Alex, Thanks for the clarification!
> If the DUT is processing frame-by-frame we cannot expect reliable comparison > to other detectors via NAB scoring, although your modification may help. One > issue with the mod, however, is if you shift anomaly scores of 1 outside of > the window, these become FPs. For example, if the full frame (125 data > points) is initially within a window, it's possible your mod changes the > score from 1 TP to 1 TP + 62 FPs. I have to agree with your assessment especially for smaller windows where the shift will always drag the anomalies outside the ground truth window. One more approach I’d like to test is to only chose the end of the frame as the recorded anomaly. I’ll update you on the results of that as soon as I have them. This should eliminate the effect of the ratio of ground truth window size to DUT frame size has on the scoring process. > Addressing your listed concerns: > 1. This brings up an important point I should have mentioned earlier; my > apologies. The score normalization method [1] assumes there are 44 TPs in the > dataset, and also that the Baseline detector has run. I’m glad you mentioned this, I was unaware the normalization step made those assumptions. I assume I’m better off comparing the raw scores instead, correct? > 2. It is okay for the metrics' counts to vary for different application > profiles. For a given DUT, the optimization step calculates the best > threshold -- i.e. likelihood value above which a data point is anomalous -- > for each application profile, where the best threshold is that which > maximizes the score. Thus, consider the application profile "Rewards Low FP > Rate". The optimal threshold for this profile will likely be higher than that > of the other profiles because then the DUT outputs fewer detections, which > likely results in fewer FPs. So different thresholds based on application profile will result in different predicted anomalies and therefore different counts of TP,TN,FN, and FP. Got it! > 3. The issue with your mod I mentioned above, and to a lesser extent the > normalization method from (1), may explain the results here. What confusion > matrix are you calculating? Is this post-processing you're doing on the > results? If so, I'm sure myself and others would be interested in seeing it. I’m referring to the confusion matrix in the results csv that the scorer produces i.e. the TP,TN,FN,FP counts with precision being TP/(TP+FP) and recall being TP/(TP+FN). For example, for the standard profile: Numenta: 33% precision - 0.5% recall - score: -20 DUT: 17% precision - 78% recall - score: -563 This seems problematic. Best, Nick
