Hi Kenneth,

Thanks for your prompt reply.
Yes this is from single user. But I am planning to use this user as a
global that will be managed by admins.
I trained all spam with the same --username.
I change fillfactor to 90 after the training, not at the beginning.
but this did not solve the problem.

Algorithm graham burton
Tokenizer chain

What do you suggest about number of traning ham/spam mails.
Does 2K mail enough? I trained dspam with TEFT option. After the
training I switch to TOE in dspam.conf
I would like to reduce database size(currently 600MB) without loosing
spam catch rate.


Here is the debug log. As you see there is a 22 second delay between
"pgsql query..." line and BNR pattern.
It seems dspam spends during the database query.


Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Processing body
token 'visit'
Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Finished
tokenizing (ngram) message
Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] pgsql query length: 11051
Tue Mar 29 15:15:25 2011
Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
instantiated: 'bnr.s|0.00_0.00_0.05_'
Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
instantiated: 'bnr.s|0.00_0.05_0.30_'
Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
instantiated: 'bnr.s|0.05_0.30_0.10_'


Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] Finished
tokenizing (ngram) message
Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] pgsql query length: 11023
Tue Mar 29 15:23:32 2011
Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
instantiated: 'bnr.s|0.00_0.00_0.05_'
Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
instantiated: 'bnr.s|0.00_0.05_0.30_'



Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Processing body
token 'org"'
Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Finished
tokenizing (ngram) message
Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] pgsql query length: 28271
Tue Mar 29 15:35:08 2011
Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
instantiated: 'bnr.s|0.00_0.00_0.50_'
Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
instantiated: 'bnr.s|0.00_0.50_0.10_'
Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
instantiated: 'bnr.s|0.50_0.10_0.15_'



Thanks.


On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote:
> On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote:
>> Hi,
>>
>> I am testing git version of dspam with PostgreSQL 9.0 running on
>> FreeBSD 8 (Dual core cpu, 4 GB memory)
>>
>> I trained dspam with 110K spam and 50K ham mails. Now I have more than
>> 7 million entry on dspam.
>>
>> dspam=# SELECT count(*) from dspam_token_data ;
>>   count
>> ---------
>>  7075311
>> (1 row)
>>
>> I vacuum and reindex database regularly.
>>
>> When I start the dspam, processing an email tooks 40-50 sec at the
>> beginning than drops to 10sec.
>> If I made this test with more powerful server(quad core cpu with 16GB
>> memory). it takes 0.01 secs.
>> I belive that the problem with the small server about large database
>> entries. but I would like to get better performance
>> on the small server as well. Any idea?
>>
>> Do you think that sqlite might be better then pgsql on this setup? or
>> did I train dspam with alots of spam/ham?
>>
>> Thanks.
>>
>
> Hi Ibrahim,
>
> Are these 7 million tokens for a single user? What tokenizer are you
> using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful
> lot of training. The docs usually recommend 2k messages each of ham
> and spam. When we generated a base corpus for our user community,
> we pruned the resulting millions of tokens down to about 300k. Another
> thing that can help is to cluster your data on the uid+token index.
> It looks like you cannot keep the full active token pages in memory
> with only a 4GB system. Look at your paging/swapping stats. You may
> be able to reduce your memory footprint which should help your performance.
> Do you have your FILL FACTOR set to allow HOT updates?
>
> Cheers,
> Ken
>

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to