-------- Original-Nachricht --------
> Datum: Thu, 30 Jul 2009 19:01:37 +0200
> Von: "Sebastian Toepfer" <[email protected]>
> An: [email protected]
> Betreff: Re: [Dspam-user] Upgrade dspam 3.6.8 to 3.9.0-git

> Hello Steve,
> 
Hallo Sebastian


> thanks, my holliday is rescued :)
>
Why? What have I written so good to rescue your holiday?


> >> Hello,
> >>
> > Hallo Sebastian,
> >
> >
> >> sorry for my poor english :(
> >>
> > No problem. We understand you. If you have issues with English then
> write 
> > in German (some of us will understand you). But if you can please
> continue 
> > in English.
> >
> >
> >> I'll upgrade my dspam-3.6.8 from debian(etch) to dspam-3.9.0-git (self
> >> compiled? Never fear! Not on solaris this time :)). But have a few ?? 
> >> over
> >> my head.
> >>
> >> first:
> >> steve wrote: "You know btw that DSPAM out of the box allows you to
> create
> >> corpi on a per user basis?"
> >>
> >> How? Where I can find the documentation about this feature?
> >>
> > See README in the root directory of the source, chapter "2.5 DSPAM USER 
> > PREFERENCES" . This is a short abstract of the preference for creating 
> > corpi:
> > ------
> > makeCorpus { on | off }
> > When activated, a maildir-style corpus is maintained in the user's data
> > directory (DSPAM_HOME/DATA/USERNAME), suitable for future retraining or
> > other analysis. (default:off)
> > ------
> >
> >
> >> Can I use this
> >> for backups (see next question)?
> >>
> > Yes. You could misuse that functionality as a sort of backup.
> >
> >
> >> second:
> >> It's posibile with this method, create corpi from old installion and 
> >> train
> >> with this the new one,
> >>
> > No. You can't create a corpus from a old installation. Enabling 
> > "makeCorpus" has only a effect of new arrived mails after you activate 
> > that preference. Old mails which are already delivered will not be 
> > affected by that option.
> >
> 
> is nothing for me :(
> 
Okay.


> >
> >> to change the tokinzier without retrain for the
> >> users. Because I use dspam at home and the "user" have train dspam
> about
> >> (3)years and the kill me if the must do this again :(
> >>
> > If I understand that right you are asking if you could shorten the 
> > training for the new installation by using old data. Right? Yes! You can
> > do that. You could dump or copy the old data and import it on the new 
> > installation. But if I see that right then you are planing to change the
> > tokenizer and changing tokenizer mostly means that old data is useless.
> >
> 
> bad news ... I've read thats other tokinzier are better,
>
Better in what? If it would be so clear which tokenizer is the best then we 
would probably remove all the others. But it's not that easy. For some setups 
tokenizer A is better then tokenizer B and so on...


> why it's not 
> possiblie to migrate the data from one tokinzier to another? It's a
> problem 
> how dspam create this token - it's only one way?
> 
Yep. The reason is very easy:
1) Not all tokenizers use the same schema/pattern
2) There is no chain information saved inside the token
3) Computing from normal text to token is easy but way back is hard


I am now going to explain deeply how the tokenizers do create the 
tokens/patterns. I do that because I hope new users will search the mailinglist 
archives and stop asking over and over the same question. I will just show the 
token generating part. Internally DSPAM uses algorithms for calculating the 
probability and the confidence factor. I am not going to explain the later two 
parts. Just the token creation. Beside the token creations DSPAM uses different 
weight on the generated tokens depending which tokenizer is used. I am as well 
not going to explain that. I have done that already in the past and the info 
about the weight of the tokens inside the tokenizers is explained there. If you 
need that info then please search the mailinglist and read there more about it.


So now the technical mambo-jamob. Let me explain:
--------------------------------------------------
Topic 1:
==================================================
Tokenizer WORD is breaking up into single words. For example the text:
"Heute Abend war ich mit meiner Freundin im Kino und habe viel gelacht."
Would be breaken up in:
1) Heute
2) Abend
3) war
4) ich
5) mit
6) meiner
7) Freundin
8) im
9) Kino
10) und
11) habe
12) viel
13) gelacht

And then DSPAM would create the tokens for each word:
TOKEN: 'Heute' CRC: 6716984897371635712
TOKEN: 'Abend' CRC: 6670531613365895168
TOKEN: 'war' CRC: 4772677679197454336
TOKEN: 'ich' CRC: 6329956816985784320
TOKEN: 'mit' CRC: 5158417007107899392
TOKEN: 'meiner' CRC: 4773009072114954240
TOKEN: 'Freundin' CRC: 13580161102417572361
TOKEN: 'im' CRC: 5811385145726337024
TOKEN: 'Kino' CRC: 6035516550826426368
TOKEN: 'und' CRC: 6670506629311496192
TOKEN: 'habe' CRC: 6712962585043402752
TOKEN: 'viel' CRC: 5844870173739188224
TOKEN: 'gelacht' CRC: 5158829993465032208


Tokenizer CHAIN would break up the same mail into (+ = combine words):
1) Heute+Abend
2) Abend+war
3) war+ich
4) ich+mit
5) mit+meiner
6) meiner+Freundin
7) Freundin+im
8) im+Kino
9) Kino+und
10) und+habe
11) habe+viel
12) viel+gelacht

And then DSPAM would create the tokens for each chain:
TOKEN: 'Heute+Abend' CRC: 9299536586222406967
TOKEN: 'Abend+war' CRC: 5205867775940263209
TOKEN: 'war+ich' CRC: 6329956649787979024
TOKEN: 'ich+mit' CRC: 5158416839735805488
TOKEN: 'mit+meiner' CRC: 9567822050683308311
TOKEN: 'meiner+Freundin' CRC: 11339548565549479056
TOKEN: 'Freundin+im' CRC: 7816109150855533158
TOKEN: 'im+Kino' CRC: 6035516551245899312
TOKEN: 'Kino+und' CRC: 3139684354012378707
TOKEN: 'und+habe' CRC: 2029218973535212134
TOKEN: 'habe+viel' CRC: 15552379170419714363
TOKEN: 'viel+gelacht' CRC: 5059261385542544937


Tokenizer OSB would break up the same mail into (+ = combine words, # = <skip>):
1) Kino+#+#+#+gelacht
2) und+#+#+gelacht
3) habe+#+gelacht
4) viel+gelacht
5) im+#+#+#+viel
6) Kino+#+#+viel
7) und+#+viel
8) habe+viel
9) Freundin+#+#+#+habe
10) im+#+#+habe
11) Kino+#+habe
12) und+habe
13) meiner+#+#+#+und
14) Freundin+#+#+und
15) im+#+und
16) Kino+und
17) mit+#+#+#+Kino
18) meiner+#+#+Kino
19) Freundin+#+Kino
20) im+Kino
21) ich+#+#+#+im
22) mit+#+#+im
23) meiner+#+im
24) Freundin+im
25) war+#+#+#+Freundin
26) ich+#+#+Freundin
27) mit+#+Freundin
28) meiner+Freundin
29) Abend+#+#+#+meiner
30) war+#+#+meiner
31) ich+#+meiner
32) mit+meiner
33) Heute+#+#+#+mit
34) Abend+#+#+mit
35) war+#+mit
36) ich+mit

And then DSPAM would create the tokens for each pattern:
TOKEN: 'Kino+#+#+#+gelacht' CRC: 3148349109242633294
TOKEN: 'und+#+#+gelacht' CRC: 2006833870839550408
TOKEN: 'habe+#+gelacht' CRC: 2006883881861244457
TOKEN: 'viel+gelacht' CRC: 5059261385542544937
TOKEN: 'im+#+#+#+viel' CRC: 16100764786021230948
TOKEN: 'Kino+#+#+viel' CRC: 16082144607504427364
TOKEN: 'und+#+viel' CRC: 10458140588311374092
TOKEN: 'habe+viel' CRC: 15552379170419714363
TOKEN: 'Freundin+#+#+#+habe' CRC: 1991158605521709403
TOKEN: 'im+#+#+habe' CRC: 15211418373216069988
TOKEN: 'Kino+#+habe' CRC: 16865398141328091395
TOKEN: 'und+habe' CRC: 2029218973535212134
TOKEN: 'meiner+#+#+#+und' CRC: 14982435885105910831
TOKEN: 'Freundin+#+#+und' CRC: 17912671458991389317
TOKEN: 'im+#+und' CRC: 8183715938297249958
TOKEN: 'Kino+und' CRC: 3139684354012378707
TOKEN: 'mit+#+#+#+Kino' CRC: 15973767036771659876
TOKEN: 'meiner+#+#+Kino' CRC: 15990647948548780029
TOKEN: 'Freundin+#+Kino' CRC: 1671950834524128732
TOKEN: 'im+Kino' CRC: 6035516551245899312
TOKEN: 'ich+#+#+#+im' CRC: 16038087078191622500
TOKEN: 'mit+#+#+im' CRC: 10834693566633404695
TOKEN: 'meiner+#+im' CRC: 1465587418199637282
TOKEN: 'Freundin+im' CRC: 7816109150855533158
TOKEN: 'war+#+#+#+Freundin' CRC: 17493766208078576673
TOKEN: 'ich+#+#+Freundin' CRC: 5758453548536397908
TOKEN: 'mit+#+Freundin' CRC: 11320398811460250377
TOKEN: 'meiner+Freundin' CRC: 11339548565549479056
TOKEN: 'Abend+#+#+#+meiner' CRC: 8544044731047037263
TOKEN: 'war+#+#+meiner' CRC: 14722667808637756004
TOKEN: 'ich+#+meiner' CRC: 14702440976645933412
TOKEN: 'mit+meiner' CRC: 9567822050683308311
TOKEN: 'Heute+#+#+#+mit' CRC: 2006452661602586241
TOKEN: 'Abend+#+#+mit' CRC: 5482652074219693289
TOKEN: 'war+#+mit' CRC: 15707817493435847227
TOKEN: 'ich+mit' CRC: 5158416839735805488

NOTE for OSB: Technically the above is not 100% correct since DSPAM starts from 
the beginning and not from the end of the mail. I was to lazy to change my 
OpenOffice.Org Calc sheet to reflect that. But I think you all get the 
direction how the tokens are generated.


Tokenizer SBPH would break up the same mail into (+ = combine words, # = 
<skip>):
1) gelacht
2) viel+gelacht
3) habe+#+gelacht
4) habe+viel+gelacht
5) und+#+#+gelacht
6) und+#+viel+gelacht
7) und+habe+#+gelacht
8) und+habe+viel+gelacht
9) Kino+#+#+#+gelacht
10) Kino+#+#+viel+gelacht
11) Kino+#+habe+#+gelacht
12) Kino+#+habe+viel+gelacht
13) Kino+und+#+#+gelacht
14) Kino+und+#+viel+gelacht
15) Kino+und+habe+#+gelacht
16) Kino+und+habe+viel+gelacht
17) viel
18) habe+viel
19) und+#+viel
20) und+habe+viel
21) Kino+#+#+viel
22) Kino+#+habe+viel
23) Kino+und+#+viel
24) Kino+und+habe+viel
25) im+#+#+#+viel
26) im+#+#+habe+viel
27) im+#+und+#+viel
28) im+#+und+habe+viel
29) im+Kino+#+#+viel
30) im+Kino+#+habe+viel
31) im+Kino+und+#+viel
32) im+Kino+und+habe+viel
33) habe
34) und+habe
35) Kino+#+habe
36) Kino+und+habe
37) im+#+#+habe
38) im+#+und+habe
39) im+Kino+#+habe
40) im+Kino+und+habe
41) Freundin+#+#+#+habe
42) Freundin+#+#+und+habe
43) Freundin+#+Kino+#+habe
44) Freundin+#+Kino+und+habe
45) Freundin+im+#+#+habe
46) Freundin+im+#+und+habe
47) Freundin+im+Kino+#+habe
48) Freundin+im+Kino+und+habe
49) und
50) Kino+und
51) im+#+und
52) im+Kino+und
53) Freundin+#+#+und
54) Freundin+#+Kino+und
55) Freundin+im+#+und
56) Freundin+im+Kino+und
57) meiner+#+#+#+und
58) meiner+#+#+Kino+und
59) meiner+#+im+#+und
60) meiner+#+im+Kino+und
61) meiner+Freundin+#+#+und
62) meiner+Freundin+#+Kino+und
63) meiner+Freundin+im+#+und
64) meiner+Freundin+im+Kino+und
65) Kino
66) im+Kino
67) Freundin+#+Kino
68) Freundin+im+Kino
69) meiner+#+#+Kino
70) meiner+#+im+Kino
71) meiner+Freundin+#+Kino
72) meiner+Freundin+im+Kino
73) mit+#+#+#+Kino
74) mit+#+#+im+Kino
75) mit+#+Freundin+#+Kino
76) mit+#+Freundin+im+Kino
77) mit+meiner+#+#+Kino
78) mit+meiner+#+im+Kino
79) mit+meiner+Freundin+#+Kino
80) mit+meiner+Freundin+im+Kino
81) im
82) Freundin+im
83) meiner+#+im
84) meiner+Freundin+im
85) mit+#+#+im
86) mit+#+Freundin+im
87) mit+meiner+#+im
88) mit+meiner+Freundin+im
89) ich+#+#+#+im
90) ich+#+#+Freundin+im
91) ich+#+meiner+#+im
92) ich+#+meiner+Freundin+im
93) ich+mit+#+#+im
94) ich+mit+#+Freundin+im
95) ich+mit+meiner+#+im
96) ich+mit+meiner+Freundin+im
97) Freundin
98) meiner+Freundin
99) mit+#+Freundin
100) mit+meiner+Freundin
101) ich+#+#+Freundin
102) ich+#+meiner+Freundin
103) ich+mit+#+Freundin
104) ich+mit+meiner+Freundin
105) war+#+#+#+Freundin
106) war+#+#+meiner+Freundin
107) war+#+mit+#+Freundin
108) war+#+mit+meiner+Freundin
109) war+ich+#+#+Freundin
110) war+ich+#+meiner+Freundin
111) war+ich+mit+#+Freundin
112) war+ich+mit+meiner+Freundin
113) meiner
114) mit+meiner
115) ich+#+meiner
116) ich+mit+meiner
117) war+#+#+meiner
118) war+#+mit+meiner
119) war+ich+#+meiner
120) war+ich+mit+meiner
121) Abend+#+#+#+meiner
122) Abend+#+#+mit+meiner
123) Abend+#+ich+#+meiner
124) Abend+#+ich+mit+meiner
125) Abend+war+#+#+meiner
126) Abend+war+#+mit+meiner
127) Abend+war+ich+#+meiner
128) Abend+war+ich+mit+meiner
129) mit
130) ich+mit
131) war+#+mit
132) war+ich+mit
133) Abend+#+#+mit
134) Abend+#+ich+mit
135) Abend+war+#+mit
136) Abend+war+ich+mit
137) Heute+#+#+#+mit
138) Heute+#+#+ich+mit
139) Heute+#+war+#+mit
140) Heute+#+war+ich+mit
141) Heute+Abend+#+#+mit
142) Heute+Abend+#+ich+mit
143) Heute+Abend+war+#+mit
144) Heute+Abend+war+ich+mit

And then DSPAM would create the tokens for each pattern:
TOKEN: 'gelacht' CRC: 5158829993465032208
TOKEN: 'viel+gelacht' CRC: 5059261385542544937
TOKEN: 'habe+#+gelacht' CRC: 2006883881861244457
TOKEN: 'habe+viel+gelacht' CRC: 17018992409010758715
TOKEN: 'und+#+#+gelacht' CRC: 2006833870839550408
TOKEN: 'und+#+viel+gelacht' CRC: 11390176152559640594
TOKEN: 'und+habe+#+gelacht' CRC: 14122710027983374098
TOKEN: 'und+habe+viel+gelacht' CRC: 17012358470641669179
TOKEN: 'Kino+#+#+#+gelacht' CRC: 3148349109242633294
TOKEN: 'Kino+#+#+viel+gelacht' CRC: 2465761806596665413
TOKEN: 'Kino+#+habe+#+gelacht' CRC: 9212073874585409349
TOKEN: 'Kino+#+habe+viel+gelacht' CRC: 2441929876380959827
TOKEN: 'Kino+und+#+#+gelacht' CRC: 3884236675463582185
TOKEN: 'Kino+und+#+viel+gelacht' CRC: 9588796112820570248
TOKEN: 'Kino+und+habe+#+gelacht' CRC: 15635871664798674824
TOKEN: 'Kino+und+habe+viel+gelacht' CRC: 15730866926752352773
TOKEN: 'viel' CRC: 5844870173739188224
TOKEN: 'habe+viel' CRC: 15552379170419714363
TOKEN: 'und+#+viel' CRC: 10458140588311374092
TOKEN: 'und+habe+viel' CRC: 15561095549081626939
TOKEN: 'Kino+#+#+viel' CRC: 16082144607504427364
TOKEN: 'Kino+#+habe+viel' CRC: 4459771146848046574
TOKEN: 'Kino+und+#+viel' CRC: 10458174477295434923
TOKEN: 'Kino+und+habe+viel' CRC: 16689363735248968540
TOKEN: 'im+#+#+#+viel' CRC: 16100764786021230948
TOKEN: 'im+#+#+habe+viel' CRC: 361495487179856336
TOKEN: 'im+#+und+#+viel' CRC: 10458174279073923713
TOKEN: 'im+#+und+habe+viel' CRC: 18048631823991129461
TOKEN: 'im+Kino+#+#+viel' CRC: 999589455442663823
TOKEN: 'im+Kino+#+habe+viel' CRC: 754854007182855662
TOKEN: 'im+Kino+und+#+viel' CRC: 13596196537167264906
TOKEN: 'im+Kino+und+habe+viel' CRC: 16689408353592404828
TOKEN: 'habe' CRC: 6712962585043402752
TOKEN: 'und+habe' CRC: 2029218973535212134
TOKEN: 'Kino+#+habe' CRC: 16865398141328091395
TOKEN: 'Kino+und+habe' CRC: 4403215055096353382
TOKEN: 'im+#+#+habe' CRC: 15211418373216069988
TOKEN: 'im+#+und+habe' CRC: 4415064577369281126
TOKEN: 'im+Kino+#+habe' CRC: 16865398282259489795
TOKEN: 'im+Kino+und+habe' CRC: 17288013589001183885
TOKEN: 'Freundin+#+#+#+habe' CRC: 1991158605521709403
TOKEN: 'Freundin+#+#+und+habe' CRC: 16528542051568490135
TOKEN: 'Freundin+#+Kino+#+habe' CRC: 8243217039978347783
TOKEN: 'Freundin+#+Kino+und+habe' CRC: 10726735021174036825
TOKEN: 'Freundin+im+#+#+habe' CRC: 17816582324937038052
TOKEN: 'Freundin+im+#+und+habe' CRC: 11902259067882653282
TOKEN: 'Freundin+im+Kino+#+habe' CRC: 17029665200395531479
TOKEN: 'Freundin+im+Kino+und+habe' CRC: 10971501632817134186
TOKEN: 'und' CRC: 6670506629311496192
TOKEN: 'Kino+und' CRC: 3139684354012378707
TOKEN: 'im+#+und' CRC: 8183715938297249958
TOKEN: 'im+Kino+und' CRC: 766330759622587987
TOKEN: 'Freundin+#+#+und' CRC: 17912671458991389317
TOKEN: 'Freundin+#+Kino+und' CRC: 13986308648741369452
TOKEN: 'Freundin+im+#+und' CRC: 91869388660448376
TOKEN: 'Freundin+im+Kino+und' CRC: 3385135257430165459
TOKEN: 'meiner+#+#+#+und' CRC: 14982435885105910831
TOKEN: 'meiner+#+#+Kino+und' CRC: 4226515320529629590
TOKEN: 'meiner+#+im+#+und' CRC: 16253799335637820502
TOKEN: 'meiner+#+im+Kino+und' CRC: 13394501699007962644
TOKEN: 'meiner+Freundin+#+#+und' CRC: 16293319120595454954
TOKEN: 'meiner+Freundin+#+Kino+und' CRC: 13393065477179921708
TOKEN: 'meiner+Freundin+im+#+und' CRC: 5966400137366783433
TOKEN: 'meiner+Freundin+im+Kino+und' CRC: 4792303306188615536
TOKEN: 'Kino' CRC: 6035516550826426368
TOKEN: 'im+Kino' CRC: 6035516551245899312
TOKEN: 'Freundin+#+Kino' CRC: 1671950834524128732
TOKEN: 'Freundin+im+Kino' CRC: 17854107114517854309
TOKEN: 'meiner+#+#+Kino' CRC: 15990647948548780029
TOKEN: 'meiner+#+im+Kino' CRC: 13888078793784186575
TOKEN: 'meiner+Freundin+#+Kino' CRC: 3232270807921846989
TOKEN: 'meiner+Freundin+im+Kino' CRC: 17099445887404722442
TOKEN: 'mit+#+#+#+Kino' CRC: 15973767036771659876
TOKEN: 'mit+#+#+im+Kino' CRC: 1120299120353238160
TOKEN: 'mit+#+Freundin+#+Kino' CRC: 5050762065960351229
TOKEN: 'mit+#+Freundin+im+Kino' CRC: 13846215677765703509
TOKEN: 'mit+meiner+#+#+Kino' CRC: 16343570441473109980
TOKEN: 'mit+meiner+#+im+Kino' CRC: 17812691054866224847
TOKEN: 'mit+meiner+Freundin+#+Kino' CRC: 1086065226225016130
TOKEN: 'mit+meiner+Freundin+im+Kino' CRC: 4395133216194152765
TOKEN: 'im' CRC: 5811385145726337024
TOKEN: 'Freundin+im' CRC: 7816109150855533158
TOKEN: 'meiner+#+im' CRC: 1465587418199637282
TOKEN: 'meiner+Freundin+im' CRC: 17710698886775658306
TOKEN: 'mit+#+#+im' CRC: 10834693566633404695
TOKEN: 'mit+#+Freundin+im' CRC: 18111038972047310256
TOKEN: 'mit+meiner+#+im' CRC: 1465587275890214843
TOKEN: 'mit+meiner+Freundin+im' CRC: 17710754364967909490
TOKEN: 'ich+#+#+#+im' CRC: 16038087078191622500
TOKEN: 'ich+#+#+Freundin+im' CRC: 5138177127977516584
TOKEN: 'ich+#+meiner+#+im' CRC: 7598768687643625872
TOKEN: 'ich+#+meiner+Freundin+im' CRC: 13848636115401072991
TOKEN: 'ich+mit+#+#+im' CRC: 10834729762232558615
TOKEN: 'ich+mit+#+Freundin+im' CRC: 18113705072675278256
TOKEN: 'ich+mit+meiner+#+im' CRC: 4566882011436074906
TOKEN: 'ich+mit+meiner+Freundin+im' CRC: 15958778763309502441
TOKEN: 'Freundin' CRC: 13580161102417572361
TOKEN: 'meiner+Freundin' CRC: 11339548565549479056
TOKEN: 'mit+#+Freundin' CRC: 11320398811460250377
TOKEN: 'mit+meiner+Freundin' CRC: 11701509035464193201
TOKEN: 'ich+#+#+Freundin' CRC: 5758453548536397908
TOKEN: 'ich+#+meiner+Freundin' CRC: 14734124733490463921
TOKEN: 'ich+mit+#+Freundin' CRC: 7712133220574661181
TOKEN: 'ich+mit+meiner+Freundin' CRC: 4279576675461360019
TOKEN: 'war+#+#+#+Freundin' CRC: 17493766208078576673
TOKEN: 'war+#+#+meiner+Freundin' CRC: 13509209483715422404
TOKEN: 'war+#+mit+#+Freundin' CRC: 17942848210230282853
TOKEN: 'war+#+mit+meiner+Freundin' CRC: 459960140544208734
TOKEN: 'war+ich+#+#+Freundin' CRC: 8674439904769966164
TOKEN: 'war+ich+#+meiner+Freundin' CRC: 7528935976228917086
TOKEN: 'war+ich+mit+#+Freundin' CRC: 7712194212032937837
TOKEN: 'war+ich+mit+meiner+Freundin' CRC: 17059398211508890020
TOKEN: 'meiner' CRC: 4773009072114954240
TOKEN: 'mit+meiner' CRC: 9567822050683308311
TOKEN: 'ich+#+meiner' CRC: 14702440976645933412
TOKEN: 'ich+mit+meiner' CRC: 9567785545379955735
TOKEN: 'war+#+#+meiner' CRC: 14722667808637756004
TOKEN: 'war+#+mit+meiner' CRC: 15102709312457837529
TOKEN: 'war+ich+#+meiner' CRC: 11050021678326778794
TOKEN: 'war+ich+mit+meiner' CRC: 16941397287622239551
TOKEN: 'Abend+#+#+#+meiner' CRC: 8544044731047037263
TOKEN: 'Abend+#+#+mit+meiner' CRC: 6700917376176391980
TOKEN: 'Abend+#+ich+#+meiner' CRC: 1454797817133172575
TOKEN: 'Abend+#+ich+mit+meiner' CRC: 16196371955299811851
TOKEN: 'Abend+war+#+#+meiner' CRC: 1470573290570285919
TOKEN: 'Abend+war+#+mit+meiner' CRC: 16574673948525685929
TOKEN: 'Abend+war+ich+#+meiner' CRC: 12595187249194953946
TOKEN: 'Abend+war+ich+mit+meiner' CRC: 7875344944142258172
TOKEN: 'mit' CRC: 5158417007107899392
TOKEN: 'ich+mit' CRC: 5158416839735805488
TOKEN: 'war+#+mit' CRC: 15707817493435847227
TOKEN: 'war+ich+mit' CRC: 6905336139605378569
TOKEN: 'Abend+#+#+mit' CRC: 5482652074219693289
TOKEN: 'Abend+#+ich+mit' CRC: 2006454003823721484
TOKEN: 'Abend+war+#+mit' CRC: 15698522771525150782
TOKEN: 'Abend+war+ich+mit' CRC: 8949741539749834179
TOKEN: 'Heute+#+#+#+mit' CRC: 2006452661602586241
TOKEN: 'Heute+#+#+ich+mit' CRC: 10912094934613969813
TOKEN: 'Heute+#+war+#+mit' CRC: 6155167828760649639
TOKEN: 'Heute+#+war+ich+mit' CRC: 16279494732467846352
TOKEN: 'Heute+Abend+#+#+mit' CRC: 17451034817009114672
TOKEN: 'Heute+Abend+#+ich+mit' CRC: 4079088572591062061
TOKEN: 'Heute+Abend+war+#+mit' CRC: 18059387714294556703
TOKEN: 'Heute+Abend+war+ich+mit' CRC: 11818656812148744564

NOTE for SBPH: And again here the same as for OSB. Technically the above is not 
100% correct since DSPAM starts from the beginning and not from the end of the 
mail. I was to lazy to change my OpenOffice.Org Calc sheet to reflect that. But 
I think you all get the direction how the tokens are generated.
==================================================


Topic 2:
==================================================
While above you clearly see (for example in WORD) that the word "Heute" is 
before the word "Abend" and that word "war" is after "Abend" etc... But when 
you have that data inside the database you don't know the chain of the words. 
The original mail could have been "Der Abend war Heute besonders gut" or "Am 
Heutigen Abend war es kalt" or it could be "Am Abend ist es immer dunkel aber 
Heute war es hell" etc... You don't know how the words are chained together. 
While that information is not important for the WORD tokenizer it is important 
for CHAIN, OSB and SBPH. So transforming from pure WORD to the other tokenizers 
is not possible. Let allone the transformation from CRC "6716984897371635712" 
to the word "Heute". Transforming from the others (CHAIN, OSB or SBPH) to WORD 
would be possible but only if you could easy solve the problem mentioned in 
topic 3 (read below).
==================================================


Topic 3:
==================================================
By now you should have realized that from pure tokens (the CRC) it is a huge 
task to convert back to the real word or chain or pattern. It would be 
technically possible but would require a huge database of premade CRC and their 
corespondenting WORD, CHAIN, OSB and SBPH pattern or a very, very, very fast 
Computer able to bruteforce the CRC's. The database would be ultra huge and 
even if we would have all the possible combination for each and every character 
that DSPAM does not filter out of the original mail stream (stuff like 
punctuation (.,!?:;) and other unwanted characters (+-#*) etc) up to a word 
length of lets say 20 characters, still the lookup inside such a huge database 
(which is probably above Petrabyte or even more) would require much time. So it 
is not really practicable.
==================================================



> > 3 years of data is all fine and okay but to be honest you will not loose
> > much. Just the first days will lead to more training but after a short 
> > time DSPAM will catch up and be very accurate.
> >
> 
> It's a small installation only ca. 30.000 mails in this 3 years ... and 
> 20.000 own by me :) .. so I think it's take a year to reach current 
> accurate.
>
No way. A year? NEVER! Expect a bunch of corrections (in the 2-digit area) and 
you would be already easy above 90% or even 95%. Just take something like OSB 
or CHAIN. Don't go with WORD in your case.


> Or what do you think how long it takes with this low volume?
> E.g. 
> one user has only 700 Ham but 1500Spam (accurance 91.40% - she loves dspam
> :)).
> 
Not much time. Really. And you still could pretrain a merged or shared,merged 
group and speedup the process. You can find SPAM corpi everywhere on the net 
(es gibt sie (fast) wie Sand am Meer).


> >
> >> any other pitfalls?
> >>
> > Not really.
> >
> 
> Very good news.
> 
:)


> >
> >> I use dspam with mysql as backend and without groups.
> >>
> > If you have many users then using groups could help to shorten training 
> > time.
> >
> 
> Only 5 user with very different mails. My old solution was a single user 
> spamfilter which result in very very bad accurance. I've found dspam an 
> surprised how well it works (200 or 300 mails and it rocks)! The learning 
> with forwarding was a other big hit, beause we use pop3 and how should we 
> train the filter which run on a gateway?
> 
Either with the DSPAM Web UI or directly from within the email client (we have 
plugins for Mozilla Thunderbird, Lotus Notes and Microsoft Outlook (and 
possibly others. Just ask here and I am sure someone has made something you 
could reuse)).



> Sebastian 
> 
Steve

-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to