>Peter, let's do an example:
>
>I assume : for MaxFiles = 13000 -> 13000 files in each, spam and notspam 
>and ~3000 files in each of both error locations - >2GB is not enough!
>
>Currently the HMM uses 4 BDB's with in sum ~ 100MB for ~3000 files (in sum) 
>on my test system - looks OK.?
>
>The logic is flat - HMM uses only sequences of 4 words and one symbol 
>(word) and counts/stores only these. A real HMM >would also count/store all 
>stages below
>
>3 - 1
>2 - 1
>1 - 1 (1-1 it's like Bayes is doing it)
>
>I'll explain this a bit in detail.
>
>Now for a number of 10 words 0 ..... 9 the following is stored in the model
>
>0,1,2,3->4
>1,2,3,4->5
>2,3,4,5->6
>3,4,5,6->7
>4,5,6,7->8
>5,6,7,8->9
>
>in math: for n words in a mail we need to store n-4 records (~n) for the 
>very simple model
>
>Now only let's use the only first sequence 0,1,2,3->4 in a real HMM - HMM 
>would store
>
>0->1
>0,1->2
>0,1,2->3
>0,1,2,3->4
>
>You see, we need now 4 times the record count than before (~4n).
>
>ASSP limits the word count per file to 600 - let's say we have an avg. of 
>400 words in 30.000 files
>
>400*30.000*4 = 48.000.000 records in a database
>
>The still used flat model will need ~12.000.000 records in worth case (no 
>sequence occurs more than one time).
>Realistic are ~ 6.000.000 records in ASSP. This is aprox. 20 times the 
>Bayes-spamdb.
>
>It is not really much you think? If we would hold all records in RAM, we 
>would have to do this in each worker (*10 !!).
>Perl will need ~ 100 Byte for one record to hold it in memory:
>
>6.000.000 * 10 * 100 = 6.000.000.000 Byte -> ~ 6GB
>
>With some 'hyper-logic' it is possible to reduce the record cound to 50% - 
>but this will cost runtime.
>
>I think we've reached the 2GB even with 50%!
>
>Thomas

----------------------------------------------------------------

Thanks Thomas,

The HMM maths makes clear sense now.

I do appreciate the logic, even though it means I might need to
allow more allocated memory on the mail side of my Server :-)

Many thanks for the time you just spent to explain this!

Peter 


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to