Hi,
I have done a very simple test against the KDD datasets wich can be
download from http://nsl.cs.unb.ca/NSL-KDD/,
the KDDTrain+.TXT and KDDTest+.TXT are used a training and test
dataset respectively, and I add the following header to them to let
them be CSV files:
duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class

The training and validate commands are as following:
mahout trainAdaptiveLogistic --input d:\\train.csv --output d:\\model1
--target class --categories 2 --predictors duration protocol_type
service flag src_bytes dst_bytes land wrong_fragment urgent hot
num_failed_logins logged_in num_compromised root_shell su_attempted
num_root num_file_creations num_shells num_access_files
num_outbound_cmds is_host_login is_guest_login count srv_count
serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate
diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count
dst_host_same_srv_rate dst_host_diff_srv_rate
dst_host_same_src_port_rate dst_host_srv_diff_host_rate
dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate
dst_host_srv_rerror_rate --types numeric word word word numeric
numeric word numeric numeric numeric numeric word numeric numeric
numeric numeric numeric numeric numeric numeric word word numeric
numeric numeric numeric numeric numeric numeric numeric numeric
numeric numeric numeric numeric numeric numeric numeric numeric
numeric numeric numeric numeric numeric numeric numeric numeric
numeric numeric numeric numeric  --threads 4 --passes 1 --showperf
--features 500 --skipperfnum 399

mahout validateAdaptiveLogistic --input d:\\test.csv --model
d:\\model1 --auc --confusion --scores

And the output of validateAdaptiveLogistic on my system is as following:


Log-likelihood:Min=-100.00, Max=0.00, Mean=-27.13, Median=0.00

AUC = 0.48

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       c       <--Classified as
9711    0       0        |  9711        a     = normal
0       12833   0        |  12833       b     = anomaly
0       0       0        |  0           c     = unknown
Default Category: unknown: 2



Entropy Matrix: [[NaN, NaN], [-0.0, -0.0]]

There are a few questions about the output:
#1, From the Confusion Matrix, it seems all the records are get
classified correctly, but the AUC is just 0.48, it should be 1.
#2, What does the number after unknown mean, the internal code for
unknown? Since the confusionmatrix is created with default category
named unknown, so unknown will always shown in the result, even no
records are unknown, just like this example?
#3,The Entropy Matirx seems not working too.
#4,Since the result is 100% correct, is there something wrong?

Regards,

Xiaobo Gu

Reply via email to