Hi Stevan,

What is the best method for the training?
1. Training spam/ham mails separately with dspam --client --user x
--class=ham/spam
2. train with dspam_train with the same number of ham/spam

Thanks.
On Tue, Mar 29, 2011 at 11:01 PM, Ibrahim Harrani
<ibrahim.harr...@gmail.com> wrote:
> Hi Stevan,
>
> It is very nice to see you in the list after the long time.
> Sure, I trust you and I can provide you spam/ham mails. But how many
> mails do you need? :)
> After running the following query my database size became 70MB.
>
> DELETE FROM dspam_token_data   WHERE innocent_hits < 10 AND spam_hits < 10
>
> Now dspam process the mail less then one second.
> I also added many IgnoreHeader entries to dspam.conf from
> http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC
>
> PS: I think this training issue a big problem for new comers.  We need
> a good document about the training.
> If I learn it very well, I am planning to write a document.
> Thanks.
>
>
> On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote:
>> On Tue, 29 Mar 2011 17:24:28 +0300
>> Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:
>>
>>> Hi Kenneth,
>>>
>> Hello Ibrahim,
>>
>>
>>> Thanks for your prompt reply.
>>> Yes this is from single user. But I am planning to use this user as a
>>> global that will be managed by admins.
>>> I trained all spam with the same --username.
>>> I change fillfactor to 90 after the training, not at the beginning.
>>> but this did not solve the problem.
>>>
>>> Algorithm graham burton
>>> Tokenizer chain
>>>
>>> What do you suggest about number of traning ham/spam mails.
>>> Does 2K mail enough? I trained dspam with TEFT option. After the
>>> training I switch to TOE in dspam.conf
>>> I would like to reduce database size(currently 600MB) without loosing
>>> spam catch rate.
>>>
>> I don't know how open you are for suggestions? If you trust me then I would 
>> like to get hold of the data you used for the training. If you can compress 
>> the Spam/Ham and make it available for download, then I would like to offer 
>> you to do the training for you. I would do the training with my own 
>> developed application that does the training differently then the stock 
>> DSPAM training application. The end result can be consumed with stock DSPAM. 
>> So after the whole training I would just export the data from PostgreSQL and 
>> compress it and make it available to you.
>>
>> I am confident that the different training method will result in much less 
>> data then stock DSPAM training method while having at least equal catch rate 
>> (in my experience the catch rate will be better).
>>
>> Unfortunately I can not release that training application because I have 
>> made some change to stock DSPAM and that training application uses new 
>> functionallity not available in stock DSPAM.
>>
>> Anyway... if you are open minded then let me know where I can download the 
>> training data and I will do the training. I promisse that I will NOT use the 
>> data for anything other then the training. I don't think that the Spam part 
>> is sensitive but the Ham part sure is. But you have my word that I will not 
>> reuse that data or redistribute that data.
>>
>>
>>>
>>> Here is the debug log. As you see there is a 22 second delay between
>>> "pgsql query..." line and BNR pattern.
>>> It seems dspam spends during the database query.
>>>
>> Crazy. The query is just around 11K. That's nothing. And you run that on a 
>> 4GB system? This should be enough. DSPAM is not that memory hungry.
>>
>>
>>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Processing body
>>> token 'visit'
>>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Finished
>>> tokenizing (ngram) message
>>> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] pgsql query length: 
>>> 11051
>>> Tue Mar 29 15:15:25 2011
>>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
>>> instantiated: 'bnr.s|0.00_0.00_0.05_'
>>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
>>> instantiated: 'bnr.s|0.00_0.05_0.30_'
>>> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
>>> instantiated: 'bnr.s|0.05_0.30_0.10_'
>>>
>>>
>>> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] Finished
>>> tokenizing (ngram) message
>>> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] pgsql query length: 
>>> 11023
>>> Tue Mar 29 15:23:32 2011
>>> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
>>> instantiated: 'bnr.s|0.00_0.00_0.05_'
>>> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
>>> instantiated: 'bnr.s|0.00_0.05_0.30_'
>>>
>>>
>>>
>>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Processing body
>>> token 'org"'
>>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Finished
>>> tokenizing (ngram) message
>>> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] pgsql query length: 
>>> 28271
>>> Tue Mar 29 15:35:08 2011
>>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
>>> instantiated: 'bnr.s|0.00_0.00_0.50_'
>>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
>>> instantiated: 'bnr.s|0.00_0.50_0.10_'
>>> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
>>> instantiated: 'bnr.s|0.50_0.10_0.15_'
>>>
>> Really strange. 40 seconds between query and BNR? This is way to much time.
>>
>> If you trust me regarding the Ham data then I would be very much interessted 
>> to see how low I can go with the space usage and still maintain a high 
>> accuracy? After all you don't have anything to loose. And you could save 
>> your current data and then switch inside dspam.conf from one database 
>> instance to the other and see which one has better accuracy or use your 
>> current dspam.conf and switch with the one I would provide you to use with 
>> the dataset I produced and then compare the result.
>>
>> Are you open minded for such a small experiment? Just let me know.
>>
>>
>>
>>> Thanks.
>>>
>> --
>> Kind Regards from Switzerland,
>>
>> Stevan Bajić
>>
>>
>>>
>>> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote:
>>> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote:
>>> >> Hi,
>>> >>
>>> >> I am testing git version of dspam with PostgreSQL 9.0 running on
>>> >> FreeBSD 8 (Dual core cpu, 4 GB memory)
>>> >>
>>> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than
>>> >> 7 million entry on dspam.
>>> >>
>>> >> dspam=# SELECT count(*) from dspam_token_data ;
>>> >>   count
>>> >> ---------
>>> >>  7075311
>>> >> (1 row)
>>> >>
>>> >> I vacuum and reindex database regularly.
>>> >>
>>> >> When I start the dspam, processing an email tooks 40-50 sec at the
>>> >> beginning than drops to 10sec.
>>> >> If I made this test with more powerful server(quad core cpu with 16GB
>>> >> memory). it takes 0.01 secs.
>>> >> I belive that the problem with the small server about large database
>>> >> entries. but I would like to get better performance
>>> >> on the small server as well. Any idea?
>>> >>
>>> >> Do you think that sqlite might be better then pgsql on this setup? or
>>> >> did I train dspam with alots of spam/ham?
>>> >>
>>> >> Thanks.
>>> >>
>>> >
>>> > Hi Ibrahim,
>>> >
>>> > Are these 7 million tokens for a single user? What tokenizer are you
>>> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful
>>> > lot of training. The docs usually recommend 2k messages each of ham
>>> > and spam. When we generated a base corpus for our user community,
>>> > we pruned the resulting millions of tokens down to about 300k. Another
>>> > thing that can help is to cluster your data on the uid+token index.
>>> > It looks like you cannot keep the full active token pages in memory
>>> > with only a 4GB system. Look at your paging/swapping stats. You may
>>> > be able to reduce your memory footprint which should help your 
>>> > performance.
>>> > Do you have your FILL FACTOR set to allow HOT updates?
>>> >
>>> > Cheers,
>>> > Ken
>>> >
>>>
>>> _______________________________________________
>>> Dspam-user mailing list
>>> Dspam-user@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspam-user
>>>
>>
>> ------------------------------------------------------------------------------
>> Enable your software for Intel(R) Active Management Technology to meet the
>> growing manageability and security demands of your customers. Businesses
>> are taking advantage of Intel(R) vPro (TM) technology - will your software
>> be a part of the solution? Download the Intel(R) Manageability Checker
>> today! http://p.sf.net/sfu/intel-dev2devmar
>> _______________________________________________
>> Dspam-user mailing list
>> Dspam-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspam-user
>>
>

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to