On Tue, 29 Mar 2011 23:06:50 +0300 Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:
> Hi Stevan, > > What is the best method for the training? > 1. Training spam/ham mails separately with dspam --client --user x > --class=ham/spam > 2. train with dspam_train with the same number of ham/spam > Better use method 2 since dspam_train will try to balance the Spam/Ham ratio during training. > Thanks. > On Tue, Mar 29, 2011 at 11:01 PM, Ibrahim Harrani > <ibrahim.harr...@gmail.com> wrote: > > Hi Stevan, > > > > It is very nice to see you in the list after the long time. > > Sure, I trust you and I can provide you spam/ham mails. But how many > > mails do you need? :) > > After running the following query my database size became 70MB. > > > > DELETE FROM dspam_token_data WHERE innocent_hits < 10 AND spam_hits < 10 > > > > Now dspam process the mail less then one second. > > I also added many IgnoreHeader entries to dspam.conf from > > http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC > > > > PS: I think this training issue a big problem for new comers. We need > > a good document about the training. > > If I learn it very well, I am planning to write a document. > > Thanks. > > > > > > On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote: > >> On Tue, 29 Mar 2011 17:24:28 +0300 > >> Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote: > >> > >>> Hi Kenneth, > >>> > >> Hello Ibrahim, > >> > >> > >>> Thanks for your prompt reply. > >>> Yes this is from single user. But I am planning to use this user as a > >>> global that will be managed by admins. > >>> I trained all spam with the same --username. > >>> I change fillfactor to 90 after the training, not at the beginning. > >>> but this did not solve the problem. > >>> > >>> Algorithm graham burton > >>> Tokenizer chain > >>> > >>> What do you suggest about number of traning ham/spam mails. > >>> Does 2K mail enough? I trained dspam with TEFT option. After the > >>> training I switch to TOE in dspam.conf > >>> I would like to reduce database size(currently 600MB) without loosing > >>> spam catch rate. > >>> > >> I don't know how open you are for suggestions? If you trust me then I > >> would like to get hold of the data you used for the training. If you can > >> compress the Spam/Ham and make it available for download, then I would > >> like to offer you to do the training for you. I would do the training with > >> my own developed application that does the training differently then the > >> stock DSPAM training application. The end result can be consumed with > >> stock DSPAM. So after the whole training I would just export the data from > >> PostgreSQL and compress it and make it available to you. > >> > >> I am confident that the different training method will result in much less > >> data then stock DSPAM training method while having at least equal catch > >> rate (in my experience the catch rate will be better). > >> > >> Unfortunately I can not release that training application because I have > >> made some change to stock DSPAM and that training application uses new > >> functionallity not available in stock DSPAM. > >> > >> Anyway... if you are open minded then let me know where I can download the > >> training data and I will do the training. I promisse that I will NOT use > >> the data for anything other then the training. I don't think that the Spam > >> part is sensitive but the Ham part sure is. But you have my word that I > >> will not reuse that data or redistribute that data. > >> > >> > >>> > >>> Here is the debug log. As you see there is a 22 second delay between > >>> "pgsql query..." line and BNR pattern. > >>> It seems dspam spends during the database query. > >>> > >> Crazy. The query is just around 11K. That's nothing. And you run that on a > >> 4GB system? This should be enough. DSPAM is not that memory hungry. > >> > >> > >>> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Processing body > >>> token 'visit' > >>> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] Finished > >>> tokenizing (ngram) message > >>> Tue Mar 29 15:15:25 2011 1112: [03/29/2011 15:15:25] pgsql query length: > >>> 11051 > >>> Tue Mar 29 15:15:25 2011 > >>> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern > >>> instantiated: 'bnr.s|0.00_0.00_0.05_' > >>> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern > >>> instantiated: 'bnr.s|0.00_0.05_0.30_' > >>> Tue Mar 29 15:15:47 2011 1112: [03/29/2011 15:15:47] BNR pattern > >>> instantiated: 'bnr.s|0.05_0.30_0.10_' > >>> > >>> > >>> Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] Finished > >>> tokenizing (ngram) message > >>> Tue Mar 29 15:23:32 2011 1112: [03/29/2011 15:23:32] pgsql query length: > >>> 11023 > >>> Tue Mar 29 15:23:32 2011 > >>> Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern > >>> instantiated: 'bnr.s|0.00_0.00_0.05_' > >>> Tue Mar 29 15:23:41 2011 1112: [03/29/2011 15:23:41] BNR pattern > >>> instantiated: 'bnr.s|0.00_0.05_0.30_' > >>> > >>> > >>> > >>> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Processing body > >>> token 'org"' > >>> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] Finished > >>> tokenizing (ngram) message > >>> Tue Mar 29 15:35:08 2011 1112: [03/29/2011 15:35:08] pgsql query length: > >>> 28271 > >>> Tue Mar 29 15:35:08 2011 > >>> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern > >>> instantiated: 'bnr.s|0.00_0.00_0.50_' > >>> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern > >>> instantiated: 'bnr.s|0.00_0.50_0.10_' > >>> Tue Mar 29 15:35:48 2011 1112: [03/29/2011 15:35:48] BNR pattern > >>> instantiated: 'bnr.s|0.50_0.10_0.15_' > >>> > >> Really strange. 40 seconds between query and BNR? This is way to much time. > >> > >> If you trust me regarding the Ham data then I would be very much > >> interessted to see how low I can go with the space usage and still > >> maintain a high accuracy? After all you don't have anything to loose. And > >> you could save your current data and then switch inside dspam.conf from > >> one database instance to the other and see which one has better accuracy > >> or use your current dspam.conf and switch with the one I would provide you > >> to use with the dataset I produced and then compare the result. > >> > >> Are you open minded for such a small experiment? Just let me know. > >> > >> > >> > >>> Thanks. > >>> > >> -- > >> Kind Regards from Switzerland, > >> > >> Stevan Bajić > >> > >> > >>> > >>> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote: > >>> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote: > >>> >> Hi, > >>> >> > >>> >> I am testing git version of dspam with PostgreSQL 9.0 running on > >>> >> FreeBSD 8 (Dual core cpu, 4 GB memory) > >>> >> > >>> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than > >>> >> 7 million entry on dspam. > >>> >> > >>> >> dspam=# SELECT count(*) from dspam_token_data ; > >>> >> Â count > >>> >> --------- > >>> >> Â 7075311 > >>> >> (1 row) > >>> >> > >>> >> I vacuum and reindex database regularly. > >>> >> > >>> >> When I start the dspam, processing an email tooks 40-50 sec at the > >>> >> beginning than drops to 10sec. > >>> >> If I made this test with more powerful server(quad core cpu with 16GB > >>> >> memory). it takes 0.01 secs. > >>> >> I belive that the problem with the small server about large database > >>> >> entries. but I would like to get better performance > >>> >> on the small server as well. Any idea? > >>> >> > >>> >> Do you think that sqlite might be better then pgsql on this setup? or > >>> >> did I train dspam with alots of spam/ham? > >>> >> > >>> >> Thanks. > >>> >> > >>> > > >>> > Hi Ibrahim, > >>> > > >>> > Are these 7 million tokens for a single user? What tokenizer are you > >>> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful > >>> > lot of training. The docs usually recommend 2k messages each of ham > >>> > and spam. When we generated a base corpus for our user community, > >>> > we pruned the resulting millions of tokens down to about 300k. Another > >>> > thing that can help is to cluster your data on the uid+token index. > >>> > It looks like you cannot keep the full active token pages in memory > >>> > with only a 4GB system. Look at your paging/swapping stats. You may > >>> > be able to reduce your memory footprint which should help your > >>> > performance. > >>> > Do you have your FILL FACTOR set to allow HOT updates? > >>> > > >>> > Cheers, > >>> > Ken > >>> > > >>> > >>> _______________________________________________ > >>> Dspam-user mailing list > >>> Dspam-user@lists.sourceforge.net > >>> https://lists.sourceforge.net/lists/listinfo/dspam-user > >>> > >> > >> ------------------------------------------------------------------------------ > >> Enable your software for Intel(R) Active Management Technology to meet the > >> growing manageability and security demands of your customers. Businesses > >> are taking advantage of Intel(R) vPro (TM) technology - will your software > >> be a part of the solution? Download the Intel(R) Manageability Checker > >> today! http://p.sf.net/sfu/intel-dev2devmar > >> _______________________________________________ > >> Dspam-user mailing list > >> Dspam-user@lists.sourceforge.net > >> https://lists.sourceforge.net/lists/listinfo/dspam-user > >> > > > ------------------------------------------------------------------------------ Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user