>Peter, let's do an example: > >I assume : for MaxFiles = 13000 -> 13000 files in each, spam and notspam >and ~3000 files in each of both error locations - >2GB is not enough! > >Currently the HMM uses 4 BDB's with in sum ~ 100MB for ~3000 files (in sum) >on my test system - looks OK.? > >The logic is flat - HMM uses only sequences of 4 words and one symbol >(word) and counts/stores only these. A real HMM >would also count/store all >stages below > >3 - 1 >2 - 1 >1 - 1 (1-1 it's like Bayes is doing it) > >I'll explain this a bit in detail. > >Now for a number of 10 words 0 ..... 9 the following is stored in the model > >0,1,2,3->4 >1,2,3,4->5 >2,3,4,5->6 >3,4,5,6->7 >4,5,6,7->8 >5,6,7,8->9 > >in math: for n words in a mail we need to store n-4 records (~n) for the >very simple model > >Now only let's use the only first sequence 0,1,2,3->4 in a real HMM - HMM >would store > >0->1 >0,1->2 >0,1,2->3 >0,1,2,3->4 > >You see, we need now 4 times the record count than before (~4n). > >ASSP limits the word count per file to 600 - let's say we have an avg. of >400 words in 30.000 files > >400*30.000*4 = 48.000.000 records in a database > >The still used flat model will need ~12.000.000 records in worth case (no >sequence occurs more than one time). >Realistic are ~ 6.000.000 records in ASSP. This is aprox. 20 times the >Bayes-spamdb. > >It is not really much you think? If we would hold all records in RAM, we >would have to do this in each worker (*10 !!). >Perl will need ~ 100 Byte for one record to hold it in memory: > >6.000.000 * 10 * 100 = 6.000.000.000 Byte -> ~ 6GB > >With some 'hyper-logic' it is possible to reduce the record cound to 50% - >but this will cost runtime. > >I think we've reached the 2GB even with 50%! > >Thomas
---------------------------------------------------------------- Thanks Thomas, The HMM maths makes clear sense now. I do appreciate the logic, even though it means I might need to allow more allocated memory on the mail side of my Server :-) Many thanks for the time you just spent to explain this! Peter ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test