Hi Stevan, It is very nice to see you in the list after the long time. Sure, I trust you and I can provide you spam/ham mails. But how many mails do you need? :) After running the following query my database size became 70MB.
DELETE FROM dspam_token_data WHERE innocent_hits < 10 AND spam_hits < 10 Now dspam process the mail less then one second. I also added many IgnoreHeader entries to dspam.conf from http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC PS: I think this training issue a big problem for new comers. We need a good document about the training. If I learn it very well, I am planning to write a document. Thanks. On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote: > On Tue, 29 Mar 2011 17:24:28 +0300 > Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote: > >> Hi Kenneth, >> > Hello Ibrahim, > > >> Thanks for your prompt reply. >> Yes this is from single user. But I am planning to use this user as a >> global that will be managed by admins. >> I trained all spam with the same --username. >> I change fillfactor to 90 after the training, not at the beginning. >> but this did not solve the problem. >> >> Algorithm graham burton >> Tokenizer chain >> >> What do you suggest about number of traning ham/spam mails. >> Does 2K mail enough? I trained dspam with TEFT option. After the >> training I switch to TOE in dspam.conf >> I would like to reduce database size(currently 600MB) without loosing >> spam catch rate. >> > I don't know how open you are for suggestions? If you trust me then I would > like to get hold of the data you used for the training. If you can compress > the Spam/Ham and make it available for download, then I would like to offer > you to do the training for you. I would do the training with my own developed > application that does the training differently then the stock DSPAM training > application. The end result can be consumed with stock DSPAM. So after the > whole training I would just export the data from PostgreSQL and compress it > and make it available to you. > > I am confident that the different training method will result in much less > data then stock DSPAM training method while having at least equal catch rate > (in my experience the catch rate will be better). > > Unfortunately I can not release that training application because I have made > some change to stock DSPAM and that training application uses new > functionallity not available in stock DSPAM. > > Anyway... if you are open minded then let me know where I can download the > training data and I will do the training. I promisse that I will NOT use the > data for anything other then the training. I don't think that the Spam part > is sensitive but the Ham part sure is. But you have my word that I will not > reuse that data or redistribute that data. > > >> >> Here is the debug log. As you see there is a 22 second delay between >> "pgsql query..." line and BNR pattern. >> It seems dspam spends during the database query. >> > Crazy. The query is just around 11K. That's nothing. And you run that on a > 4GB system? This should be enough. DSPAM is not that memory hungry. > > >> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Processing body >> token 'visit' >> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Finished >> tokenizing (ngram) message >> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] pgsql query length: >> 11051 >> Tue Mar 29 15:15:25 2011 >> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern >> instantiated: 'bnr.s|0.00_0.00_0.05_' >> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern >> instantiated: 'bnr.s|0.00_0.05_0.30_' >> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern >> instantiated: 'bnr.s|0.05_0.30_0.10_' >> >> >> Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] Finished >> tokenizing (ngram) message >> Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] pgsql query length: >> 11023 >> Tue Mar 29 15:23:32 2011 >> Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern >> instantiated: 'bnr.s|0.00_0.00_0.05_' >> Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern >> instantiated: 'bnr.s|0.00_0.05_0.30_' >> >> >> >> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Processing body >> token 'org"' >> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Finished >> tokenizing (ngram) message >> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] pgsql query length: >> 28271 >> Tue Mar 29 15:35:08 2011 >> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern >> instantiated: 'bnr.s|0.00_0.00_0.50_' >> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern >> instantiated: 'bnr.s|0.00_0.50_0.10_' >> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern >> instantiated: 'bnr.s|0.50_0.10_0.15_' >> > Really strange. 40 seconds between query and BNR? This is way to much time. > > If you trust me regarding the Ham data then I would be very much interessted > to see how low I can go with the space usage and still maintain a high > accuracy? After all you don't have anything to loose. And you could save your > current data and then switch inside dspam.conf from one database instance to > the other and see which one has better accuracy or use your current > dspam.conf and switch with the one I would provide you to use with the > dataset I produced and then compare the result. > > Are you open minded for such a small experiment? Just let me know. > > > >> Thanks. >> > -- > Kind Regards from Switzerland, > > Stevan Bajić > > >> >> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote: >> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote: >> >> Hi, >> >> >> >> I am testing git version of dspam with PostgreSQL 9.0 running on >> >> FreeBSD 8 (Dual core cpu, 4 GB memory) >> >> >> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than >> >> 7 million entry on dspam. >> >> >> >> dspam=# SELECT count(*) from dspam_token_data ; >> >> Â count >> >> --------- >> >> Â 7075311 >> >> (1 row) >> >> >> >> I vacuum and reindex database regularly. >> >> >> >> When I start the dspam, processing an email tooks 40-50 sec at the >> >> beginning than drops to 10sec. >> >> If I made this test with more powerful server(quad core cpu with 16GB >> >> memory). it takes 0.01 secs. >> >> I belive that the problem with the small server about large database >> >> entries. but I would like to get better performance >> >> on the small server as well. Any idea? >> >> >> >> Do you think that sqlite might be better then pgsql on this setup? or >> >> did I train dspam with alots of spam/ham? >> >> >> >> Thanks. >> >> >> > >> > Hi Ibrahim, >> > >> > Are these 7 million tokens for a single user? What tokenizer are you >> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful >> > lot of training. The docs usually recommend 2k messages each of ham >> > and spam. When we generated a base corpus for our user community, >> > we pruned the resulting millions of tokens down to about 300k. Another >> > thing that can help is to cluster your data on the uid+token index. >> > It looks like you cannot keep the full active token pages in memory >> > with only a 4GB system. Look at your paging/swapping stats. You may >> > be able to reduce your memory footprint which should help your performance. >> > Do you have your FILL FACTOR set to allow HOT updates? >> > >> > Cheers, >> > Ken >> > >> >> _______________________________________________ >> Dspam-user mailing list >> Dspam-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dspam-user >> > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > Dspam-user mailing list > Dspam-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspam-user > ------------------------------------------------------------------------------ Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user