On Tue, 29 Mar 2011 17:24:28 +0300
Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:

> Hi Kenneth,
> 
Hello Ibrahim,


> Thanks for your prompt reply.
> Yes this is from single user. But I am planning to use this user as a
> global that will be managed by admins.
> I trained all spam with the same --username.
> I change fillfactor to 90 after the training, not at the beginning.
> but this did not solve the problem.
> 
> Algorithm graham burton
> Tokenizer chain
> 
> What do you suggest about number of traning ham/spam mails.
> Does 2K mail enough? I trained dspam with TEFT option. After the
> training I switch to TOE in dspam.conf
> I would like to reduce database size(currently 600MB) without loosing
> spam catch rate.
> 
I don't know how open you are for suggestions? If you trust me then I would 
like to get hold of the data you used for the training. If you can compress the 
Spam/Ham and make it available for download, then I would like to offer you to 
do the training for you. I would do the training with my own developed 
application that does the training differently then the stock DSPAM training 
application. The end result can be consumed with stock DSPAM. So after the 
whole training I would just export the data from PostgreSQL and compress it and 
make it available to you.

I am confident that the different training method will result in much less data 
then stock DSPAM training method while having at least equal catch rate (in my 
experience the catch rate will be better).

Unfortunately I can not release that training application because I have made 
some change to stock DSPAM and that training application uses new 
functionallity not available in stock DSPAM.

Anyway... if you are open minded then let me know where I can download the 
training data and I will do the training. I promisse that I will NOT use the 
data for anything other then the training. I don't think that the Spam part is 
sensitive but the Ham part sure is. But you have my word that I will not reuse 
that data or redistribute that data.


> 
> Here is the debug log. As you see there is a 22 second delay between
> "pgsql query..." line and BNR pattern.
> It seems dspam spends during the database query.
> 
Crazy. The query is just around 11K. That's nothing. And you run that on a 4GB 
system? This should be enough. DSPAM is not that memory hungry.


> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Processing body
> token 'visit'
> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Finished
> tokenizing (ngram) message
> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] pgsql query length: 
> 11051
> Tue Mar 29 15:15:25 2011
> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> instantiated: 'bnr.s|0.00_0.00_0.05_'
> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> instantiated: 'bnr.s|0.00_0.05_0.30_'
> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> instantiated: 'bnr.s|0.05_0.30_0.10_'
> 
> 
> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] Finished
> tokenizing (ngram) message
> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] pgsql query length: 
> 11023
> Tue Mar 29 15:23:32 2011
> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
> instantiated: 'bnr.s|0.00_0.00_0.05_'
> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
> instantiated: 'bnr.s|0.00_0.05_0.30_'
> 
> 
> 
> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Processing body
> token 'org"'
> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Finished
> tokenizing (ngram) message
> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] pgsql query length: 
> 28271
> Tue Mar 29 15:35:08 2011
> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> instantiated: 'bnr.s|0.00_0.00_0.50_'
> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> instantiated: 'bnr.s|0.00_0.50_0.10_'
> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> instantiated: 'bnr.s|0.50_0.10_0.15_'
> 
Really strange. 40 seconds between query and BNR? This is way to much time.

If you trust me regarding the Ham data then I would be very much interessted to 
see how low I can go with the space usage and still maintain a high accuracy? 
After all you don't have anything to loose. And you could save your current 
data and then switch inside dspam.conf from one database instance to the other 
and see which one has better accuracy or use your current dspam.conf and switch 
with the one I would provide you to use with the dataset I produced and then 
compare the result.

Are you open minded for such a small experiment? Just let me know.



> Thanks.
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić


> 
> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote:
> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote:
> >> Hi,
> >>
> >> I am testing git version of dspam with PostgreSQL 9.0 running on
> >> FreeBSD 8 (Dual core cpu, 4 GB memory)
> >>
> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than
> >> 7 million entry on dspam.
> >>
> >> dspam=# SELECT count(*) from dspam_token_data ;
> >>   count
> >> ---------
> >>  7075311
> >> (1 row)
> >>
> >> I vacuum and reindex database regularly.
> >>
> >> When I start the dspam, processing an email tooks 40-50 sec at the
> >> beginning than drops to 10sec.
> >> If I made this test with more powerful server(quad core cpu with 16GB
> >> memory). it takes 0.01 secs.
> >> I belive that the problem with the small server about large database
> >> entries. but I would like to get better performance
> >> on the small server as well. Any idea?
> >>
> >> Do you think that sqlite might be better then pgsql on this setup? or
> >> did I train dspam with alots of spam/ham?
> >>
> >> Thanks.
> >>
> >
> > Hi Ibrahim,
> >
> > Are these 7 million tokens for a single user? What tokenizer are you
> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful
> > lot of training. The docs usually recommend 2k messages each of ham
> > and spam. When we generated a base corpus for our user community,
> > we pruned the resulting millions of tokens down to about 300k. Another
> > thing that can help is to cluster your data on the uid+token index.
> > It looks like you cannot keep the full active token pages in memory
> > with only a 4GB system. Look at your paging/swapping stats. You may
> > be able to reduce your memory footprint which should help your performance.
> > Do you have your FILL FACTOR set to allow HOT updates?
> >
> > Cheers,
> > Ken
> >
> 
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user
> 

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to