Hi Stevan, What is the best method for the training? 1. Training spam/ham mails separately with dspam --client --user x --class=ham/spam 2. train with dspam_train with the same number of ham/spam
Thanks. On Tue, Mar 29, 2011 at 11:01 PM, Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote: > Hi Stevan, > > It is very nice to see you in the list after the long time. > Sure, I trust you and I can provide you spam/ham mails. But how many > mails do you need? :) > After running the following query my database size became 70MB. > > DELETE FROM dspam_token_data WHERE innocent_hits < 10 AND spam_hits < 10 > > Now dspam process the mail less then one second. > I also added many IgnoreHeader entries to dspam.conf from > http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC > > PS: I think this training issue a big problem for new comers. We need > a good document about the training. > If I learn it very well, I am planning to write a document. > Thanks. > > > On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote: >> On Tue, 29 Mar 2011 17:24:28 +0300 >> Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote: >> >>> Hi Kenneth, >>> >> Hello Ibrahim, >> >> >>> Thanks for your prompt reply. >>> Yes this is from single user. But I am planning to use this user as a >>> global that will be managed by admins. >>> I trained all spam with the same --username. >>> I change fillfactor to 90 after the training, not at the beginning. >>> but this did not solve the problem. >>> >>> Algorithm graham burton >>> Tokenizer chain >>> >>> What do you suggest about number of traning ham/spam mails. >>> Does 2K mail enough? I trained dspam with TEFT option. After the >>> training I switch to TOE in dspam.conf >>> I would like to reduce database size(currently 600MB) without loosing >>> spam catch rate. >>> >> I don't know how open you are for suggestions? If you trust me then I would >> like to get hold of the data you used for the training. If you can compress >> the Spam/Ham and make it available for download, then I would like to offer >> you to do the training for you. I would do the training with my own >> developed application that does the training differently then the stock >> DSPAM training application. The end result can be consumed with stock DSPAM. >> So after the whole training I would just export the data from PostgreSQL and >> compress it and make it available to you. >> >> I am confident that the different training method will result in much less >> data then stock DSPAM training method while having at least equal catch rate >> (in my experience the catch rate will be better). >> >> Unfortunately I can not release that training application because I have >> made some change to stock DSPAM and that training application uses new >> functionallity not available in stock DSPAM. >> >> Anyway... if you are open minded then let me know where I can download the >> training data and I will do the training. I promisse that I will NOT use the >> data for anything other then the training. I don't think that the Spam part >> is sensitive but the Ham part sure is. But you have my word that I will not >> reuse that data or redistribute that data. >> >> >>> >>> Here is the debug log. As you see there is a 22 second delay between >>> "pgsql query..." line and BNR pattern. >>> It seems dspam spends during the database query. >>> >> Crazy. The query is just around 11K. That's nothing. And you run that on a >> 4GB system? This should be enough. DSPAM is not that memory hungry. >> >> >>> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Processing body >>> token 'visit' >>> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Finished >>> tokenizing (ngram) message >>> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] pgsql query length: >>> 11051 >>> Tue Mar 29 15:15:25 2011 >>> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern >>> instantiated: 'bnr.s|0.00_0.00_0.05_' >>> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern >>> instantiated: 'bnr.s|0.00_0.05_0.30_' >>> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern >>> instantiated: 'bnr.s|0.05_0.30_0.10_' >>> >>> >>> Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] Finished >>> tokenizing (ngram) message >>> Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] pgsql query length: >>> 11023 >>> Tue Mar 29 15:23:32 2011 >>> Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern >>> instantiated: 'bnr.s|0.00_0.00_0.05_' >>> Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern >>> instantiated: 'bnr.s|0.00_0.05_0.30_' >>> >>> >>> >>> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Processing body >>> token 'org"' >>> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Finished >>> tokenizing (ngram) message >>> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] pgsql query length: >>> 28271 >>> Tue Mar 29 15:35:08 2011 >>> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern >>> instantiated: 'bnr.s|0.00_0.00_0.50_' >>> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern >>> instantiated: 'bnr.s|0.00_0.50_0.10_' >>> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern >>> instantiated: 'bnr.s|0.50_0.10_0.15_' >>> >> Really strange. 40 seconds between query and BNR? This is way to much time. >> >> If you trust me regarding the Ham data then I would be very much interessted >> to see how low I can go with the space usage and still maintain a high >> accuracy? After all you don't have anything to loose. And you could save >> your current data and then switch inside dspam.conf from one database >> instance to the other and see which one has better accuracy or use your >> current dspam.conf and switch with the one I would provide you to use with >> the dataset I produced and then compare the result. >> >> Are you open minded for such a small experiment? Just let me know. >> >> >> >>> Thanks. >>> >> -- >> Kind Regards from Switzerland, >> >> Stevan Bajić >> >> >>> >>> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote: >>> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote: >>> >> Hi, >>> >> >>> >> I am testing git version of dspam with PostgreSQL 9.0 running on >>> >> FreeBSD 8 (Dual core cpu, 4 GB memory) >>> >> >>> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than >>> >> 7 million entry on dspam. >>> >> >>> >> dspam=# SELECT count(*) from dspam_token_data ; >>> >> Â count >>> >> --------- >>> >> Â 7075311 >>> >> (1 row) >>> >> >>> >> I vacuum and reindex database regularly. >>> >> >>> >> When I start the dspam, processing an email tooks 40-50 sec at the >>> >> beginning than drops to 10sec. >>> >> If I made this test with more powerful server(quad core cpu with 16GB >>> >> memory). it takes 0.01 secs. >>> >> I belive that the problem with the small server about large database >>> >> entries. but I would like to get better performance >>> >> on the small server as well. Any idea? >>> >> >>> >> Do you think that sqlite might be better then pgsql on this setup? or >>> >> did I train dspam with alots of spam/ham? >>> >> >>> >> Thanks. >>> >> >>> > >>> > Hi Ibrahim, >>> > >>> > Are these 7 million tokens for a single user? What tokenizer are you >>> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful >>> > lot of training. The docs usually recommend 2k messages each of ham >>> > and spam. When we generated a base corpus for our user community, >>> > we pruned the resulting millions of tokens down to about 300k. Another >>> > thing that can help is to cluster your data on the uid+token index. >>> > It looks like you cannot keep the full active token pages in memory >>> > with only a 4GB system. Look at your paging/swapping stats. You may >>> > be able to reduce your memory footprint which should help your >>> > performance. >>> > Do you have your FILL FACTOR set to allow HOT updates? >>> > >>> > Cheers, >>> > Ken >>> > >>> >>> _______________________________________________ >>> Dspam-user mailing list >>> Dspam-user@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/dspam-user >>> >> >> ------------------------------------------------------------------------------ >> Enable your software for Intel(R) Active Management Technology to meet the >> growing manageability and security demands of your customers. Businesses >> are taking advantage of Intel(R) vPro (TM) technology - will your software >> be a part of the solution? Download the Intel(R) Manageability Checker >> today! http://p.sf.net/sfu/intel-dev2devmar >> _______________________________________________ >> Dspam-user mailing list >> Dspam-user@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dspam-user >> > ------------------------------------------------------------------------------ Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user