Hi Kenneth, Thanks for your prompt reply. Yes this is from single user. But I am planning to use this user as a global that will be managed by admins. I trained all spam with the same --username. I change fillfactor to 90 after the training, not at the beginning. but this did not solve the problem.
Algorithm graham burton Tokenizer chain What do you suggest about number of traning ham/spam mails. Does 2K mail enough? I trained dspam with TEFT option. After the training I switch to TOE in dspam.conf I would like to reduce database size(currently 600MB) without loosing spam catch rate. Here is the debug log. As you see there is a 22 second delay between "pgsql query..." line and BNR pattern. It seems dspam spends during the database query. Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Processing body token 'visit' Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Finished tokenizing (ngram) message Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] pgsql query length: 11051 Tue Mar 29 15:15:25 2011 Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern instantiated: 'bnr.s|0.00_0.00_0.05_' Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern instantiated: 'bnr.s|0.00_0.05_0.30_' Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern instantiated: 'bnr.s|0.05_0.30_0.10_' Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] Finished tokenizing (ngram) message Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] pgsql query length: 11023 Tue Mar 29 15:23:32 2011 Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern instantiated: 'bnr.s|0.00_0.00_0.05_' Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern instantiated: 'bnr.s|0.00_0.05_0.30_' Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Processing body token 'org"' Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Finished tokenizing (ngram) message Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] pgsql query length: 28271 Tue Mar 29 15:35:08 2011 Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern instantiated: 'bnr.s|0.00_0.00_0.50_' Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern instantiated: 'bnr.s|0.00_0.50_0.10_' Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern instantiated: 'bnr.s|0.50_0.10_0.15_' Thanks. On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote: > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote: >> Hi, >> >> I am testing git version of dspam with PostgreSQL 9.0 running on >> FreeBSD 8 (Dual core cpu, 4 GB memory) >> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than >> 7 million entry on dspam. >> >> dspam=# SELECT count(*) from dspam_token_data ; >> count >> --------- >> 7075311 >> (1 row) >> >> I vacuum and reindex database regularly. >> >> When I start the dspam, processing an email tooks 40-50 sec at the >> beginning than drops to 10sec. >> If I made this test with more powerful server(quad core cpu with 16GB >> memory). it takes 0.01 secs. >> I belive that the problem with the small server about large database >> entries. but I would like to get better performance >> on the small server as well. Any idea? >> >> Do you think that sqlite might be better then pgsql on this setup? or >> did I train dspam with alots of spam/ham? >> >> Thanks. >> > > Hi Ibrahim, > > Are these 7 million tokens for a single user? What tokenizer are you > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful > lot of training. The docs usually recommend 2k messages each of ham > and spam. When we generated a base corpus for our user community, > we pruned the resulting millions of tokens down to about 300k. Another > thing that can help is to cluster your data on the uid+token index. > It looks like you cannot keep the full active token pages in memory > with only a 4GB system. Look at your paging/swapping stats. You may > be able to reduce your memory footprint which should help your performance. > Do you have your FILL FACTOR set to allow HOT updates? > > Cheers, > Ken > ------------------------------------------------------------------------------ Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user