Hi, I have done a very simple test against the KDD datasets wich can be download from http://nsl.cs.unb.ca/NSL-KDD/, the KDDTrain+.TXT and KDDTest+.TXT are used a training and test dataset respectively, and I add the following header to them to let them be CSV files: duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
The training and validate commands are as following: mahout trainAdaptiveLogistic --input d:\\train.csv --output d:\\model1 --target class --categories 2 --predictors duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate --types numeric word word word numeric numeric word numeric numeric numeric numeric word numeric numeric numeric numeric numeric numeric numeric numeric word word numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric --threads 4 --passes 1 --showperf --features 500 --skipperfnum 399 mahout validateAdaptiveLogistic --input d:\\test.csv --model d:\\model1 --auc --confusion --scores And the output of validateAdaptiveLogistic on my system is as following: Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.13, Median=0.00 AUC = 0.48 ======================================================= Confusion Matrix ------------------------------------------------------- a b c <--Classified as 9711 0 0 | 9711 a = normal 0 12833 0 | 12833 b = anomaly 0 0 0 | 0 c = unknown Default Category: unknown: 2 Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]] There are a few questions about the output: #1, From the Confusion Matrix, it seems all the records are get classified correctly, but the AUC is just 0.48, it should be 1. #2, What does the number after unknown mean, the internal code for unknown? Since the confusionmatrix is created with default category named unknown, so unknown will always shown in the result, even no records are unknown, just like this example? #3,The Entropy Matirx seems not working too. #4,Since the result is 100% correct, is there something wrong? Regards, Xiaobo Gu
