On Tue, 29 Mar 2011 23:06:50 +0300
Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:

> Hi Stevan,
> 
> What is the best method for the training?
> 1. Training spam/ham mails separately with dspam --client --user x
> --class=ham/spam
> 2. train with dspam_train with the same number of ham/spam
> 
Better use method 2 since dspam_train will try to balance the Spam/Ham ratio 
during training.


> Thanks.
> On Tue, Mar 29, 2011 at 11:01 PM, Ibrahim Harrani
> <ibrahim.harr...@gmail.com> wrote:
> > Hi Stevan,
> >
> > It is very nice to see you in the list after the long time.
> > Sure, I trust you and I can provide you spam/ham mails. But how many
> > mails do you need? :)
> > After running the following query my database size became 70MB.
> >
> > DELETE FROM dspam_token_data   WHERE innocent_hits < 10 AND spam_hits < 10
> >
> > Now dspam process the mail less then one second.
> > I also added many IgnoreHeader entries to dspam.conf from
> > http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC
> >
> > PS: I think this training issue a big problem for new comers.  We need
> > a good document about the training.
> > If I learn it very well, I am planning to write a document.
> > Thanks.
> >
> >
> > On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote:
> >> On Tue, 29 Mar 2011 17:24:28 +0300
> >> Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:
> >>
> >>> Hi Kenneth,
> >>>
> >> Hello Ibrahim,
> >>
> >>
> >>> Thanks for your prompt reply.
> >>> Yes this is from single user. But I am planning to use this user as a
> >>> global that will be managed by admins.
> >>> I trained all spam with the same --username.
> >>> I change fillfactor to 90 after the training, not at the beginning.
> >>> but this did not solve the problem.
> >>>
> >>> Algorithm graham burton
> >>> Tokenizer chain
> >>>
> >>> What do you suggest about number of traning ham/spam mails.
> >>> Does 2K mail enough? I trained dspam with TEFT option. After the
> >>> training I switch to TOE in dspam.conf
> >>> I would like to reduce database size(currently 600MB) without loosing
> >>> spam catch rate.
> >>>
> >> I don't know how open you are for suggestions? If you trust me then I 
> >> would like to get hold of the data you used for the training. If you can 
> >> compress the Spam/Ham and make it available for download, then I would 
> >> like to offer you to do the training for you. I would do the training with 
> >> my own developed application that does the training differently then the 
> >> stock DSPAM training application. The end result can be consumed with 
> >> stock DSPAM. So after the whole training I would just export the data from 
> >> PostgreSQL and compress it and make it available to you.
> >>
> >> I am confident that the different training method will result in much less 
> >> data then stock DSPAM training method while having at least equal catch 
> >> rate (in my experience the catch rate will be better).
> >>
> >> Unfortunately I can not release that training application because I have 
> >> made some change to stock DSPAM and that training application uses new 
> >> functionallity not available in stock DSPAM.
> >>
> >> Anyway... if you are open minded then let me know where I can download the 
> >> training data and I will do the training. I promisse that I will NOT use 
> >> the data for anything other then the training. I don't think that the Spam 
> >> part is sensitive but the Ham part sure is. But you have my word that I 
> >> will not reuse that data or redistribute that data.
> >>
> >>
> >>>
> >>> Here is the debug log. As you see there is a 22 second delay between
> >>> "pgsql query..." line and BNR pattern.
> >>> It seems dspam spends during the database query.
> >>>
> >> Crazy. The query is just around 11K. That's nothing. And you run that on a 
> >> 4GB system? This should be enough. DSPAM is not that memory hungry.
> >>
> >>
> >>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Processing body
> >>> token 'visit'
> >>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Finished
> >>> tokenizing (ngram) message
> >>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] pgsql query length: 
> >>> 11051
> >>> Tue Mar 29 15:15:25 2011
> >>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> >>> instantiated: 'bnr.s|0.00_0.00_0.05_'
> >>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> >>> instantiated: 'bnr.s|0.00_0.05_0.30_'
> >>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> >>> instantiated: 'bnr.s|0.05_0.30_0.10_'
> >>>
> >>>
> >>> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] Finished
> >>> tokenizing (ngram) message
> >>> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] pgsql query length: 
> >>> 11023
> >>> Tue Mar 29 15:23:32 2011
> >>> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
> >>> instantiated: 'bnr.s|0.00_0.00_0.05_'
> >>> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
> >>> instantiated: 'bnr.s|0.00_0.05_0.30_'
> >>>
> >>>
> >>>
> >>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Processing body
> >>> token 'org"'
> >>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Finished
> >>> tokenizing (ngram) message
> >>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] pgsql query length: 
> >>> 28271
> >>> Tue Mar 29 15:35:08 2011
> >>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> >>> instantiated: 'bnr.s|0.00_0.00_0.50_'
> >>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> >>> instantiated: 'bnr.s|0.00_0.50_0.10_'
> >>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> >>> instantiated: 'bnr.s|0.50_0.10_0.15_'
> >>>
> >> Really strange. 40 seconds between query and BNR? This is way to much time.
> >>
> >> If you trust me regarding the Ham data then I would be very much 
> >> interessted to see how low I can go with the space usage and still 
> >> maintain a high accuracy? After all you don't have anything to loose. And 
> >> you could save your current data and then switch inside dspam.conf from 
> >> one database instance to the other and see which one has better accuracy 
> >> or use your current dspam.conf and switch with the one I would provide you 
> >> to use with the dataset I produced and then compare the result.
> >>
> >> Are you open minded for such a small experiment? Just let me know.
> >>
> >>
> >>
> >>> Thanks.
> >>>
> >> --
> >> Kind Regards from Switzerland,
> >>
> >> Stevan Bajić
> >>
> >>
> >>>
> >>> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote:
> >>> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I am testing git version of dspam with PostgreSQL 9.0 running on
> >>> >> FreeBSD 8 (Dual core cpu, 4 GB memory)
> >>> >>
> >>> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than
> >>> >> 7 million entry on dspam.
> >>> >>
> >>> >> dspam=# SELECT count(*) from dspam_token_data ;
> >>> >>   count
> >>> >> ---------
> >>> >>  7075311
> >>> >> (1 row)
> >>> >>
> >>> >> I vacuum and reindex database regularly.
> >>> >>
> >>> >> When I start the dspam, processing an email tooks 40-50 sec at the
> >>> >> beginning than drops to 10sec.
> >>> >> If I made this test with more powerful server(quad core cpu with 16GB
> >>> >> memory). it takes 0.01 secs.
> >>> >> I belive that the problem with the small server about large database
> >>> >> entries. but I would like to get better performance
> >>> >> on the small server as well. Any idea?
> >>> >>
> >>> >> Do you think that sqlite might be better then pgsql on this setup? or
> >>> >> did I train dspam with alots of spam/ham?
> >>> >>
> >>> >> Thanks.
> >>> >>
> >>> >
> >>> > Hi Ibrahim,
> >>> >
> >>> > Are these 7 million tokens for a single user? What tokenizer are you
> >>> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful
> >>> > lot of training. The docs usually recommend 2k messages each of ham
> >>> > and spam. When we generated a base corpus for our user community,
> >>> > we pruned the resulting millions of tokens down to about 300k. Another
> >>> > thing that can help is to cluster your data on the uid+token index.
> >>> > It looks like you cannot keep the full active token pages in memory
> >>> > with only a 4GB system. Look at your paging/swapping stats. You may
> >>> > be able to reduce your memory footprint which should help your 
> >>> > performance.
> >>> > Do you have your FILL FACTOR set to allow HOT updates?
> >>> >
> >>> > Cheers,
> >>> > Ken
> >>> >
> >>>
> >>> _______________________________________________
> >>> Dspam-user mailing list
> >>> Dspam-user@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/dspam-user
> >>>
> >>
> >> ------------------------------------------------------------------------------
> >> Enable your software for Intel(R) Active Management Technology to meet the
> >> growing manageability and security demands of your customers. Businesses
> >> are taking advantage of Intel(R) vPro (TM) technology - will your software
> >> be a part of the solution? Download the Intel(R) Manageability Checker
> >> today! http://p.sf.net/sfu/intel-dev2devmar
> >> _______________________________________________
> >> Dspam-user mailing list
> >> Dspam-user@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/dspam-user
> >>
> >
> 

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to