An ideal IDS dataset will be fully diverse (in terms of type of attacks) and 
completely free of artifacts (incurred during creation and pre-processing). 
However, ideal scenarios do not hold in real-life! -- if they do then they will 
not be real...



I agree that it is very hard to obtain datasets with payloads due to privacy 
constraints. Good anonymization procedures mostly retain the relative 
statistics of the data. For example, you may consult the following work by 
people at ICSI. 



http://www.icir.org/enterprise-tracing/devil-ccr-jan06.pdf



An overwhelming majority of network based IDSs use only spatial information 
present in packet headers. The datasets that I have mentioned in my earlier 
post can be used to evaluate such IDSs. Moreover, you can find details of the 
endpoint worm propagation dataset in the following papers:



http://www.nexginrc.org/papers/tr15-zubair.pdf

http://www.nexginrc.org/papers/gecco08-zubair.pdf



In my view, there are two directions to take dataset labeling further:



1. Improving injection procedures to ensure minimization of artifacts. This is 
more feasible if you know all parameters and environmental conditions during 
trace collection -- Know Thy Data. 



2. Use "semi-automated" ~ "semi-manual" procedures. 



@Stefano: You have probably missed this point. Semi-automated procedures still 
require manual intervention, however, it will help to reduce its magnitude 
significantly. So, we are not exactly developing a typical anomaly detection 
system. 



let me know what you think.


Reply via email to