Hallo Carlo,
> 3.2% to %0.17 -> almost 19 times more effective. > > I just checked the files to see how many of them have no body and this is the result: theia trec05p-1 # find data -type f | while read foo ; do if [ $(sed '1,/^$/d' ${foo} | wc -l) -lt 1 ] ; then echo ${foo} ; fi ; done | wc -l 331 theia trec05p-1 # cd full theia full # cat index | while read foo ; do if [ $(sed '1,/^$/d' ${foo/* /} | wc -l) -lt 1 ] ; then echo ${foo/* /} ; fi ; done | wc -l 331 theia full # 311 on the file system and 331 referenced in the index. I had 155 failures in total of which 19 are 4MB (my MessageMaxSize) and bigger: theia full # find ../data/ -type f -size +4M | wc -l 19 theia full # To be honest: I think that 155 failures are a good value. I dont see currently big needs to lower that number down any more. My training script does already handle messages that give no output in summary mode but do print out the whole message if DSPAM is instructed to deliver innocent,spam to stdout. Just an example with my latest dev version of the training script: theia full # ../../../../dspam_train_tone_v5 mergedglobal --overleap 20345 --stop-after 10 --refute --max-train 3 --spam-threshold 80 --ham-threshold 40 -i index Taking Snapshot... mergedglobal TP: 0 TN: 0 FP: 1 FN: 0 SC: 95517 NC: 54681 ==================================================================== Training corpora: Using index file: index Parameters: Show subject: No Random: No Refute: Yes Spam TONE Threshold: 0.8 Ham TONE Threshold: 0.4 Maximum retrain: 1 Overleap: 20345 Stop after: 10 ==================================================================== Training on index index... [test: nonspam] ../data/067/246 result: FAIL [#.##] (probably over MaxMessageSize) [test: spam ] ../data/067/247 result: PASS [0.99] [test: nonspam] ../data/067/248 result: PASS [0.70] [test: nonspam] ../data/067/249 result: PASS [0.85] [test: nonspam] ../data/067/250 result: PASS [0.66] [test: nonspam] ../data/067/251 result: PASS [0.59] [test: nonspam] ../data/067/252 result: PASS [0.99] [test: nonspam] ../data/067/253 result: PASS [0.59] [test: nonspam] ../data/067/254 result: PASS [1.00] [test: nonspam] ../data/067/255 result: FAIL [#.##] (probably over MaxMessageSize) TRAINING COMPLETE ==================================================================== Processed: 10 | TP: 1 | TN: 7 | FP: 0 | FN: 0 ==================================================================== Training Snapshot: mergedglobal TP: 0 TN: 0 FP: 0 FN: 0 SC: 0 NC: 0 SHR: 100.00% HSR: 0.00% OCA: 100.00% Overall Statistics: mergedglobal TP: 0 TN: 0 FP: 1 FN: 0 SC: 95517 NC: 54681 SHR: 100.00% HSR: 100.00% OCA: 0.00% theia full # ls -lah ../data/067/246 -rw-rw-r-- 1 root root 8.2M Aug 4 2005 ../data/067/246 theia full # ls -lah ../data/067/255 -rw-rw-r-- 1 root root 9.5M Aug 4 2005 ../data/067/255 theia full # I will however check 3.8.0 and look if it is better then 3.9.0. If it is, then I might go on and enhance 3.9.0 to be on par with 3.8.0. But if 3.9.0 is better then I am leaving it and not going to invest more time in fiddeling around with issues that are no real issues. At least no one was complaining so far about it. > >> Best Regards, > >> Carlo Rodrigues > >> > >> -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel