I cannot agree more with Terry. 


However, in my view, the situation is not that bleak! Some other openly 
available network datasets available are:



1. CAIDA backscatter dataset (contains reflected suspicious traffic)

2. LBNL/ICSI enterprise router dataset (contains segregated scan and benign 
traffic)

3. DEFCON 8-10 CTF datasets (contain only attack traffic during DEFCON 
competition)

4. UMASS gateway link dataset (is manually labeled by Yu Gu at University of 
Massachusetts)

5. Endpoint worm dataset (both benign and worm traffic, logged by argus -- 
probably the only data available at endpoints)



The links to download the these datasets are available at 
http://www.nexginrc.org/~zubair/research.htm. 



Labeling network datasets (or establishing "ground truth") can be a tricky 
task. There are two standard ways to create labeled IDS datasets: (1) 
separately collecting benign and malicious traffic and then injecting to create 
infected traffic profiles, (2) collecting data and then labeling it via manual 
inspection or a combination of heuristics. 



The first method has been previously used in a number of papers published at 
SIGCOMM, S&P (for example "Mining Anomalies" paper by Lakhina). However, some 
reviews that I have received from 2008 S&P indicate that this method is no 
longer trusted and there is a possibility that unwanted artifacts can be 
introduced. 



The second method can be laborious and is prone to errors. However, I have been 
working on some semi-automated procedures to label anomalies in network 
traffic. Let me know if you have any ideas in this regard. 



Reply via email to