Hi Stevan,

It is very nice to see you in the list after the long time.
Sure, I trust you and I can provide you spam/ham mails. But how many
mails do you need? :)
After running the following query my database size became 70MB.

DELETE FROM dspam_token_data   WHERE innocent_hits < 10 AND spam_hits < 10

Now dspam process the mail less then one second.
I also added many IgnoreHeader entries to dspam.conf from
http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC

PS: I think this training issue a big problem for new comers.  We need
a good document about the training.
If I learn it very well, I am planning to write a document.
Thanks.


On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote:
> On Tue, 29 Mar 2011 17:24:28 +0300
> Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:
>
>> Hi Kenneth,
>>
> Hello Ibrahim,
>
>
>> Thanks for your prompt reply.
>> Yes this is from single user. But I am planning to use this user as a
>> global that will be managed by admins.
>> I trained all spam with the same --username.
>> I change fillfactor to 90 after the training, not at the beginning.
>> but this did not solve the problem.
>>
>> Algorithm graham burton
>> Tokenizer chain
>>
>> What do you suggest about number of traning ham/spam mails.
>> Does 2K mail enough? I trained dspam with TEFT option. After the
>> training I switch to TOE in dspam.conf
>> I would like to reduce database size(currently 600MB) without loosing
>> spam catch rate.
>>
> I don't know how open you are for suggestions? If you trust me then I would 
> like to get hold of the data you used for the training. If you can compress 
> the Spam/Ham and make it available for download, then I would like to offer 
> you to do the training for you. I would do the training with my own developed 
> application that does the training differently then the stock DSPAM training 
> application. The end result can be consumed with stock DSPAM. So after the 
> whole training I would just export the data from PostgreSQL and compress it 
> and make it available to you.
>
> I am confident that the different training method will result in much less 
> data then stock DSPAM training method while having at least equal catch rate 
> (in my experience the catch rate will be better).
>
> Unfortunately I can not release that training application because I have made 
> some change to stock DSPAM and that training application uses new 
> functionallity not available in stock DSPAM.
>
> Anyway... if you are open minded then let me know where I can download the 
> training data and I will do the training. I promisse that I will NOT use the 
> data for anything other then the training. I don't think that the Spam part 
> is sensitive but the Ham part sure is. But you have my word that I will not 
> reuse that data or redistribute that data.
>
>
>>
>> Here is the debug log. As you see there is a 22 second delay between
>> "pgsql query..." line and BNR pattern.
>> It seems dspam spends during the database query.
>>
> Crazy. The query is just around 11K. That's nothing. And you run that on a 
> 4GB system? This should be enough. DSPAM is not that memory hungry.
>
>
>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Processing body
>> token 'visit'
>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Finished
>> tokenizing (ngram) message
>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] pgsql query length: 
>> 11051
>> Tue Mar 29 15:15:25 2011
>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
>> instantiated: 'bnr.s|0.00_0.00_0.05_'
>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
>> instantiated: 'bnr.s|0.00_0.05_0.30_'
>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
>> instantiated: 'bnr.s|0.05_0.30_0.10_'
>>
>>
>> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] Finished
>> tokenizing (ngram) message
>> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] pgsql query length: 
>> 11023
>> Tue Mar 29 15:23:32 2011
>> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
>> instantiated: 'bnr.s|0.00_0.00_0.05_'
>> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
>> instantiated: 'bnr.s|0.00_0.05_0.30_'
>>
>>
>>
>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Processing body
>> token 'org"'
>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Finished
>> tokenizing (ngram) message
>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] pgsql query length: 
>> 28271
>> Tue Mar 29 15:35:08 2011
>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
>> instantiated: 'bnr.s|0.00_0.00_0.50_'
>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
>> instantiated: 'bnr.s|0.00_0.50_0.10_'
>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
>> instantiated: 'bnr.s|0.50_0.10_0.15_'
>>
> Really strange. 40 seconds between query and BNR? This is way to much time.
>
> If you trust me regarding the Ham data then I would be very much interessted 
> to see how low I can go with the space usage and still maintain a high 
> accuracy? After all you don't have anything to loose. And you could save your 
> current data and then switch inside dspam.conf from one database instance to 
> the other and see which one has better accuracy or use your current 
> dspam.conf and switch with the one I would provide you to use with the 
> dataset I produced and then compare the result.
>
> Are you open minded for such a small experiment? Just let me know.
>
>
>
>> Thanks.
>>
> --
> Kind Regards from Switzerland,
>
> Stevan Bajić
>
>
>>
>> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote:
>> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote:
>> >> Hi,
>> >>
>> >> I am testing git version of dspam with PostgreSQL 9.0 running on
>> >> FreeBSD 8 (Dual core cpu, 4 GB memory)
>> >>
>> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than
>> >> 7 million entry on dspam.
>> >>
>> >> dspam=# SELECT count(*) from dspam_token_data ;
>> >>   count
>> >> ---------
>> >>  7075311
>> >> (1 row)
>> >>
>> >> I vacuum and reindex database regularly.
>> >>
>> >> When I start the dspam, processing an email tooks 40-50 sec at the
>> >> beginning than drops to 10sec.
>> >> If I made this test with more powerful server(quad core cpu with 16GB
>> >> memory). it takes 0.01 secs.
>> >> I belive that the problem with the small server about large database
>> >> entries. but I would like to get better performance
>> >> on the small server as well. Any idea?
>> >>
>> >> Do you think that sqlite might be better then pgsql on this setup? or
>> >> did I train dspam with alots of spam/ham?
>> >>
>> >> Thanks.
>> >>
>> >
>> > Hi Ibrahim,
>> >
>> > Are these 7 million tokens for a single user? What tokenizer are you
>> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful
>> > lot of training. The docs usually recommend 2k messages each of ham
>> > and spam. When we generated a base corpus for our user community,
>> > we pruned the resulting millions of tokens down to about 300k. Another
>> > thing that can help is to cluster your data on the uid+token index.
>> > It looks like you cannot keep the full active token pages in memory
>> > with only a 4GB system. Look at your paging/swapping stats. You may
>> > be able to reduce your memory footprint which should help your performance.
>> > Do you have your FILL FACTOR set to allow HOT updates?
>> >
>> > Cheers,
>> > Ken
>> >
>>
>> _______________________________________________
>> Dspam-user mailing list
>> Dspam-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspam-user
>>
>
> ------------------------------------------------------------------------------
> Enable your software for Intel(R) Active Management Technology to meet the
> growing manageability and security demands of your customers. Businesses
> are taking advantage of Intel(R) vPro (TM) technology - will your software
> be a part of the solution? Download the Intel(R) Manageability Checker
> today! http://p.sf.net/sfu/intel-dev2devmar
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user
>

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to