On Tue, 29 Mar 2011 23:01:41 +0300
Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:

> Hi Stevan,
> 
Hello Ibrahim,


> It is very nice to see you in the list after the long time.
> Sure, I trust you and I can provide you spam/ham mails. But how many
> mails do you need? :)
>
whatever you give me or whatever you want to train. It's up to you. You can 
give me all you have. If I can choose then give me as much Ham as you can.


> After running the following query my database size became 70MB.
> 
> DELETE FROM dspam_token_data   WHERE innocent_hits < 10 AND spam_hits < 10
> 
Well... you just deleted data that could be good. I don't know your DSPAM 
configuration so I can not fully judge if this was clever or not?


> Now dspam process the mail less then one second.
> I also added many IgnoreHeader entries to dspam.conf from
> http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Working_DSPAM%2BPOSTFIX%2BMYSQL%2BCLAMAV_Setup_by_PaulC
> 
LOL. Those are headers I once added into Paul's configuration. Currently I use 
those IgnoreHeader entries:
IgnoreHeader Accept-Language
IgnoreHeader Approved
IgnoreHeader Archive
IgnoreHeader Authentication-Results
IgnoreHeader Cache-Post-Path
IgnoreHeader Cancel-Key
IgnoreHeader Cancel-Lock
IgnoreHeader Complaints-To
IgnoreHeader Content-Description
IgnoreHeader Content-Disposition
IgnoreHeader Content-ID
IgnoreHeader Content-Language
IgnoreHeader Content-Return
IgnoreHeader Content-Transfer-Encoding
IgnoreHeader Content-Type
IgnoreHeader DKIM-Signature
IgnoreHeader Date
IgnoreHeader Disposition-Notification-To
IgnoreHeader DomainKey-Signature
IgnoreHeader Importance
IgnoreHeader In-Reply-To
IgnoreHeader Injection-Info
IgnoreHeader Lines
IgnoreHeader List-Archive
IgnoreHeader List-Help
IgnoreHeader List-Id
IgnoreHeader List-Post
IgnoreHeader List-Subscribe
IgnoreHeader List-Unsubscribe
IgnoreHeader Message-ID
IgnoreHeader Message-Id
IgnoreHeader NNTP-Posting-Date
IgnoreHeader NNTP-Posting-Host
IgnoreHeader Newsgroups
IgnoreHeader OpenPGP
IgnoreHeader Organization
IgnoreHeader Originator
IgnoreHeader PGP-ID
IgnoreHeader Path
IgnoreHeader Received
IgnoreHeader Received-SPF
IgnoreHeader References
IgnoreHeader Reply-To
IgnoreHeader Resent-Date
IgnoreHeader Resent-From
IgnoreHeader Resent-Message-ID
IgnoreHeader Thread-Index
IgnoreHeader Thread-Topic
IgnoreHeader User-Agent
IgnoreHeader X--MailScanner-SpamCheck
IgnoreHeader X-AV-Scanned
IgnoreHeader X-AVAS-Spam-Level
IgnoreHeader X-AVAS-Spam-Score
IgnoreHeader X-AVAS-Spam-Status
IgnoreHeader X-AVAS-Spam-Symbols
IgnoreHeader X-AVAS-Virus-Status
IgnoreHeader X-AVK-Virus-Check
IgnoreHeader X-Abuse
IgnoreHeader X-Abuse-Contact
IgnoreHeader X-Abuse-Info
IgnoreHeader X-Abuse-Management
IgnoreHeader X-Abuse-To
IgnoreHeader X-Abuse-and-DMCA-Info
IgnoreHeader X-Accept-Language
IgnoreHeader X-Admission-MailScanner-SpamCheck
IgnoreHeader X-Admission-MailScanner-SpamScore
IgnoreHeader X-Amavis-Alert
IgnoreHeader X-Amavis-Hold
IgnoreHeader X-Amavis-Modified
IgnoreHeader X-Amavis-OS-Fingerprint
IgnoreHeader X-Amavis-PenPals
IgnoreHeader X-Amavis-PolicyBank
IgnoreHeader X-AntiVirus
IgnoreHeader X-Antispam
IgnoreHeader X-Antivirus
IgnoreHeader X-Antivirus-Scanner
IgnoreHeader X-Antivirus-Status
IgnoreHeader X-Archive
IgnoreHeader X-Assp-Spam-Prob
IgnoreHeader X-Attention
IgnoreHeader X-BTI-AntiSpam
IgnoreHeader X-Barracuda
IgnoreHeader X-Barracuda-Bayes
IgnoreHeader X-Barracuda-Spam-Flag
IgnoreHeader X-Barracuda-Spam-Report
IgnoreHeader X-Barracuda-Spam-Score
IgnoreHeader X-Barracuda-Spam-Status
IgnoreHeader X-Barracuda-Virus-Scanned
IgnoreHeader X-BeenThere
IgnoreHeader X-Bogosity
IgnoreHeader X-Brightmail-Tracker
IgnoreHeader X-CRM114-CacheID
IgnoreHeader X-CRM114-Status
IgnoreHeader X-CRM114-Version
IgnoreHeader X-CTASD-IP
IgnoreHeader X-CTASD-RefID
IgnoreHeader X-CTASD-Sender
IgnoreHeader X-Cache
IgnoreHeader X-ClamAntiVirus-Scanner
IgnoreHeader X-Comment-To
IgnoreHeader X-Comments
IgnoreHeader X-Complaints
IgnoreHeader X-Complaints-Info
IgnoreHeader X-Complaints-To
IgnoreHeader X-DKIM
IgnoreHeader X-DMCA-Complaints-To
IgnoreHeader X-DMCA-Notifications
IgnoreHeader X-Despammed-Tracer
IgnoreHeader X-ELTE-SpamCheck
IgnoreHeader X-ELTE-SpamCheck-Details
IgnoreHeader X-ELTE-SpamScore
IgnoreHeader X-ELTE-SpamVersion
IgnoreHeader X-ELTE-VirusStatus
IgnoreHeader X-Enigmail-Supports
IgnoreHeader X-Enigmail-Version
IgnoreHeader X-Evolution-Source
IgnoreHeader X-Extra-Info
IgnoreHeader X-FSFE-MailScanner
IgnoreHeader X-FSFE-MailScanner-From
IgnoreHeader X-Face
IgnoreHeader X-Fellowship-MailScanner
IgnoreHeader X-Fellowship-MailScanner-From
IgnoreHeader X-Forwarded
IgnoreHeader X-GMX-Antispam
IgnoreHeader X-GMX-Antivirus
IgnoreHeader X-GPG-Fingerprint
IgnoreHeader X-GPG-Key-ID
IgnoreHeader X-GPS-DegDec
IgnoreHeader X-GPS-MGRS
IgnoreHeader X-GWSPAM
IgnoreHeader X-Gateway
IgnoreHeader X-Greylist
IgnoreHeader X-HTMLM
IgnoreHeader X-HTMLM-Info
IgnoreHeader X-HTMLM-Score
IgnoreHeader X-HTTP-Posting-Host
IgnoreHeader X-HTTP-UserAgent
IgnoreHeader X-HTTP-Via
IgnoreHeader X-Headers-End
IgnoreHeader X-ID
IgnoreHeader X-IMAIL-SPAM-STATISTICS
IgnoreHeader X-IMAIL-SPAM-URL-DBL
IgnoreHeader X-IMAIL-SPAM-VALFROM
IgnoreHeader X-IMAIL-SPAM-VALHELO
IgnoreHeader X-IMAIL-SPAM-VALREVDNS
IgnoreHeader X-Info
IgnoreHeader X-IronPort-Anti-Spam-Filtered
IgnoreHeader X-IronPort-Anti-Spam-Result
IgnoreHeader X-KSV-Antispam
IgnoreHeader X-Kaspersky-Antivirus
IgnoreHeader X-MDAV-Processed
IgnoreHeader X-MDRemoteIP
IgnoreHeader X-MDaemon-Deliver-To
IgnoreHeader X-MIE-MailScanner-SpamCheck
IgnoreHeader X-MIMEOLE
IgnoreHeader X-MIMETrack
IgnoreHeader X-MMS-Spam-Filter-ID
IgnoreHeader X-MS-Exchange-Forest-RulesExecuted
IgnoreHeader X-MS-Exchange-Organization-Antispam-Report
IgnoreHeader X-MS-Exchange-Organization-AuthAs
IgnoreHeader X-MS-Exchange-Organization-AuthDomain
IgnoreHeader X-MS-Exchange-Organization-AuthMechanism
IgnoreHeader X-MS-Exchange-Organization-AuthSource
IgnoreHeader X-MS-Exchange-Organization-Journal-Report
IgnoreHeader X-MS-Exchange-Organization-Original-Scl
IgnoreHeader X-MS-Exchange-Organization-Original-Sender
IgnoreHeader X-MS-Exchange-Organization-OriginalArrivalTime
IgnoreHeader X-MS-Exchange-Organization-OriginalSize
IgnoreHeader X-MS-Exchange-Organization-PCL
IgnoreHeader X-MS-Exchange-Organization-Quarantine
IgnoreHeader X-MS-Exchange-Organization-SCL
IgnoreHeader X-MS-Exchange-Organization-SenderIdResult
IgnoreHeader X-MS-Has-Attach
IgnoreHeader X-MS-TNEF-Correlator
IgnoreHeader X-MSMail-Priority
IgnoreHeader X-MailScanner
IgnoreHeader X-MailScanner-Information
IgnoreHeader X-MailScanner-SpamCheck
IgnoreHeader X-Mailer
IgnoreHeader X-Mailman-Version
IgnoreHeader X-Mlf-Spam-Status
IgnoreHeader X-NAI-Spam-Checker-Version
IgnoreHeader X-NAI-Spam-Flag
IgnoreHeader X-NAI-Spam-Level
IgnoreHeader X-NAI-Spam-Report
IgnoreHeader X-NAI-Spam-Route
IgnoreHeader X-NAI-Spam-Rules
IgnoreHeader X-NAI-Spam-Score
IgnoreHeader X-NAI-Spam-Threshold
IgnoreHeader X-NEWT-spamscore
IgnoreHeader X-NNTP-Posting-Date
IgnoreHeader X-NNTP-Posting-Host
IgnoreHeader X-NetcoreISpam1-ECMScanner
IgnoreHeader X-NetcoreISpam1-ECMScanner-From
IgnoreHeader X-NetcoreISpam1-ECMScanner-Information
IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamCheck
IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamScore
IgnoreHeader X-Newsreader
IgnoreHeader X-Newsserver
IgnoreHeader X-No-Archive
IgnoreHeader X-No-Spam
IgnoreHeader X-OSBF-Lua-Score
IgnoreHeader X-OWM-SpamCheck
IgnoreHeader X-OWM-VirusCheck
IgnoreHeader X-Olypen-Virus
IgnoreHeader X-Orig-Path
IgnoreHeader X-OriginalArrivalTime
IgnoreHeader X-Originating-IP
IgnoreHeader X-PAA-AntiVirus
IgnoreHeader X-PAA-AntiVirus-Message
IgnoreHeader X-PGP-Fingerprint
IgnoreHeader X-PGP-Hash
IgnoreHeader X-PGP-ID
IgnoreHeader X-PGP-Key
IgnoreHeader X-PGP-Key-Fingerprint
IgnoreHeader X-PGP-KeyID
IgnoreHeader X-PGP-Sig
IgnoreHeader X-PIRONET-NDH-MailScanner-SpamCheck
IgnoreHeader X-PIRONET-NDH-MailScanner-SpamScore
IgnoreHeader X-PMX
IgnoreHeader X-PMX-Version
IgnoreHeader X-PN-SPAMFiltered
IgnoreHeader X-Posting-Agent
IgnoreHeader X-Posting-ID
IgnoreHeader X-Posting-IP
IgnoreHeader X-Priority
IgnoreHeader X-Proofpoint-Spam-Details
IgnoreHeader X-Qmail-Scanner-1.25st
IgnoreHeader X-Quarantine-ID
IgnoreHeader X-RAV-AntiVirus
IgnoreHeader X-RITmySpam
IgnoreHeader X-RITmySpam-IP
IgnoreHeader X-RITmySpam-Spam
IgnoreHeader X-Rc-Spam
IgnoreHeader X-Rc-Virus
IgnoreHeader X-Received-Date
IgnoreHeader X-RedHat-Spam-Score
IgnoreHeader X-RedHat-Spam-Warning
IgnoreHeader X-RegEx
IgnoreHeader X-RegEx-Score
IgnoreHeader X-Rocket-Spam
IgnoreHeader X-SA-GROUP
IgnoreHeader X-SA-RECEIPTSTATUS
IgnoreHeader X-STA-NotSpam
IgnoreHeader X-STA-Spam
IgnoreHeader X-Scam-grey
IgnoreHeader X-Scanned-By
IgnoreHeader X-Sender
IgnoreHeader X-SenderID
IgnoreHeader X-Sohu-Antivirus
IgnoreHeader X-Spam
IgnoreHeader X-Spam-ASN
IgnoreHeader X-Spam-Check
IgnoreHeader X-Spam-Checked-By
IgnoreHeader X-Spam-Checker
IgnoreHeader X-Spam-Checker-Version
IgnoreHeader X-Spam-Clean
IgnoreHeader X-Spam-DCC
IgnoreHeader X-Spam-Details
IgnoreHeader X-Spam-Filter
IgnoreHeader X-Spam-Filtered
IgnoreHeader X-Spam-Flag
IgnoreHeader X-Spam-Level
IgnoreHeader X-Spam-OrigSender
IgnoreHeader X-Spam-Pct
IgnoreHeader X-Spam-Prev-Subject
IgnoreHeader X-Spam-Processed
IgnoreHeader X-Spam-Pyzor
IgnoreHeader X-Spam-Rating
IgnoreHeader X-Spam-Report
IgnoreHeader X-Spam-Scanned
IgnoreHeader X-Spam-Score
IgnoreHeader X-Spam-Status
IgnoreHeader X-Spam-Tagged
IgnoreHeader X-Spam-Tests
IgnoreHeader X-Spam-Tests-Failed
IgnoreHeader X-Spam-Virus
IgnoreHeader X-Spam-Warning
IgnoreHeader X-Spam-detection-level
IgnoreHeader X-SpamAssassin-Clean
IgnoreHeader X-SpamAssassin-Warning
IgnoreHeader X-SpamBouncer
IgnoreHeader X-SpamCatcher-Score
IgnoreHeader X-SpamCop-Checked
IgnoreHeader X-SpamCop-Disposition
IgnoreHeader X-SpamCop-Whitelisted
IgnoreHeader X-SpamDetected
IgnoreHeader X-SpamInfo
IgnoreHeader X-SpamPal
IgnoreHeader X-SpamPal-Timeout
IgnoreHeader X-SpamReason
IgnoreHeader X-SpamScore
IgnoreHeader X-SpamTest-Categories
IgnoreHeader X-SpamTest-Info
IgnoreHeader X-SpamTest-Method
IgnoreHeader X-SpamTest-Status
IgnoreHeader X-SpamTest-Version
IgnoreHeader X-Spamadvice
IgnoreHeader X-Spamarrest-noauth
IgnoreHeader X-Spamarrest-speedcode
IgnoreHeader X-Spambayes-Classification
IgnoreHeader X-Spamcount
IgnoreHeader X-Spamsensitivity
IgnoreHeader X-TERRACE-SPAMMARK
IgnoreHeader X-TERRACE-SPAMRATE
IgnoreHeader X-TM-AS-Category-Info
IgnoreHeader X-TM-AS-MatchedID
IgnoreHeader X-TM-AS-Product-Ver
IgnoreHeader X-TM-AS-Result
IgnoreHeader X-TMWD-Spam-Summary
IgnoreHeader X-TNEFEvaluated
IgnoreHeader X-Text-Classification
IgnoreHeader X-Text-Classification-Data
IgnoreHeader X-Trace
IgnoreHeader X-UCD-Spam-Score
IgnoreHeader X-User-Agent
IgnoreHeader X-User-ID
IgnoreHeader X-User-System
IgnoreHeader X-Virus-Check
IgnoreHeader X-Virus-Checked
IgnoreHeader X-Virus-Checker-Version
IgnoreHeader X-Virus-Scan
IgnoreHeader X-Virus-Scanned
IgnoreHeader X-Virus-Scanner
IgnoreHeader X-Virus-Scanner-Result
IgnoreHeader X-Virus-Status
IgnoreHeader X-VirusChecked
IgnoreHeader X-Virusscan
IgnoreHeader X-WSS-ID
IgnoreHeader X-WinProxy-AntiVirus
IgnoreHeader X-WinProxy-AntiVirus-Message
IgnoreHeader X-Yandex-Forward
IgnoreHeader X-Yandex-Front
IgnoreHeader X-Yandex-Spam
IgnoreHeader X-Yandex-TimeMark
IgnoreHeader X-cid
IgnoreHeader X-iHateSpam-Checked
IgnoreHeader X-iHateSpam-Quarantined
IgnoreHeader X-policyd-weight
IgnoreHeader X-purgate
IgnoreHeader X-purgate-Ad
IgnoreHeader X-purgate-ID
IgnoreHeader X-sgxh1
IgnoreHeader X-to-viruscore
IgnoreHeader Xref
IgnoreHeader acceptlanguage
IgnoreHeader thread-index
IgnoreHeader x-uscspam


> PS: I think this training issue a big problem for new comers.  We need
> a good document about the training.
>
It's hard to explain how to train well without explaining a bunch of 
mathematical concepts. Most users are simple thinking that more training is 
equal to better result. And this is definately not true. Clever training is the 
key. My little application was born exactly because of this. I needed a way to 
automatically train my DSPAM instance without supervision from a Spam honeypot 
and from normal mail outbound without overlearning DSPAM.


> If I learn it very well, I am planning to write a document.
> Thanks.
> 
> 
> On Tue, Mar 29, 2011 at 10:35 PM, Stevan Bajić <ste...@bajic.ch> wrote:
> > On Tue, 29 Mar 2011 17:24:28 +0300
> > Ibrahim Harrani <ibrahim.harr...@gmail.com> wrote:
> >
> >> Hi Kenneth,
> >>
> > Hello Ibrahim,
> >
> >
> >> Thanks for your prompt reply.
> >> Yes this is from single user. But I am planning to use this user as a
> >> global that will be managed by admins.
> >> I trained all spam with the same --username.
> >> I change fillfactor to 90 after the training, not at the beginning.
> >> but this did not solve the problem.
> >>
> >> Algorithm graham burton
> >> Tokenizer chain
> >>
> >> What do you suggest about number of traning ham/spam mails.
> >> Does 2K mail enough? I trained dspam with TEFT option. After the
> >> training I switch to TOE in dspam.conf
> >> I would like to reduce database size(currently 600MB) without loosing
> >> spam catch rate.
> >>
> > I don't know how open you are for suggestions? If you trust me then I would 
> > like to get hold of the data you used for the training. If you can compress 
> > the Spam/Ham and make it available for download, then I would like to offer 
> > you to do the training for you. I would do the training with my own 
> > developed application that does the training differently then the stock 
> > DSPAM training application. The end result can be consumed with stock 
> > DSPAM. So after the whole training I would just export the data from 
> > PostgreSQL and compress it and make it available to you.
> >
> > I am confident that the different training method will result in much less 
> > data then stock DSPAM training method while having at least equal catch 
> > rate (in my experience the catch rate will be better).
> >
> > Unfortunately I can not release that training application because I have 
> > made some change to stock DSPAM and that training application uses new 
> > functionallity not available in stock DSPAM.
> >
> > Anyway... if you are open minded then let me know where I can download the 
> > training data and I will do the training. I promisse that I will NOT use 
> > the data for anything other then the training. I don't think that the Spam 
> > part is sensitive but the Ham part sure is. But you have my word that I 
> > will not reuse that data or redistribute that data.
> >
> >
> >>
> >> Here is the debug log. As you see there is a 22 second delay between
> >> "pgsql query..." line and BNR pattern.
> >> It seems dspam spends during the database query.
> >>
> > Crazy. The query is just around 11K. That's nothing. And you run that on a 
> > 4GB system? This should be enough. DSPAM is not that memory hungry.
> >
> >
> >> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Processing body
> >> token 'visit'
> >> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] Finished
> >> tokenizing (ngram) message
> >> Tue Mar 29 15:15:25 2011  1112: [03/29/2011 15:15:25] pgsql query length: 
> >> 11051
> >> Tue Mar 29 15:15:25 2011
> >> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> >> instantiated: 'bnr.s|0.00_0.00_0.05_'
> >> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> >> instantiated: 'bnr.s|0.00_0.05_0.30_'
> >> Tue Mar 29 15:15:47 2011  1112: [03/29/2011 15:15:47] BNR pattern
> >> instantiated: 'bnr.s|0.05_0.30_0.10_'
> >>
> >>
> >> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] Finished
> >> tokenizing (ngram) message
> >> Tue Mar 29 15:23:32 2011  1112: [03/29/2011 15:23:32] pgsql query length: 
> >> 11023
> >> Tue Mar 29 15:23:32 2011
> >> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
> >> instantiated: 'bnr.s|0.00_0.00_0.05_'
> >> Tue Mar 29 15:23:41 2011  1112: [03/29/2011 15:23:41] BNR pattern
> >> instantiated: 'bnr.s|0.00_0.05_0.30_'
> >>
> >>
> >>
> >> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Processing body
> >> token 'org"'
> >> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] Finished
> >> tokenizing (ngram) message
> >> Tue Mar 29 15:35:08 2011  1112: [03/29/2011 15:35:08] pgsql query length: 
> >> 28271
> >> Tue Mar 29 15:35:08 2011
> >> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> >> instantiated: 'bnr.s|0.00_0.00_0.50_'
> >> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> >> instantiated: 'bnr.s|0.00_0.50_0.10_'
> >> Tue Mar 29 15:35:48 2011  1112: [03/29/2011 15:35:48] BNR pattern
> >> instantiated: 'bnr.s|0.50_0.10_0.15_'
> >>
> > Really strange. 40 seconds between query and BNR? This is way to much time.
> >
> > If you trust me regarding the Ham data then I would be very much 
> > interessted to see how low I can go with the space usage and still maintain 
> > a high accuracy? After all you don't have anything to loose. And you could 
> > save your current data and then switch inside dspam.conf from one database 
> > instance to the other and see which one has better accuracy or use your 
> > current dspam.conf and switch with the one I would provide you to use with 
> > the dataset I produced and then compare the result.
> >
> > Are you open minded for such a small experiment? Just let me know.
> >
> >
> >
> >> Thanks.
> >>
> > --
> > Kind Regards from Switzerland,
> >
> > Stevan Bajić
> >
> >
> >>
> >> On Tue, Mar 29, 2011 at 4:28 PM, Kenneth Marshall <k...@rice.edu> wrote:
> >> > On Tue, Mar 29, 2011 at 11:45:39AM +0300, Ibrahim Harrani wrote:
> >> >> Hi,
> >> >>
> >> >> I am testing git version of dspam with PostgreSQL 9.0 running on
> >> >> FreeBSD 8 (Dual core cpu, 4 GB memory)
> >> >>
> >> >> I trained dspam with 110K spam and 50K ham mails. Now I have more than
> >> >> 7 million entry on dspam.
> >> >>
> >> >> dspam=# SELECT count(*) from dspam_token_data ;
> >> >>   count
> >> >> ---------
> >> >>  7075311
> >> >> (1 row)
> >> >>
> >> >> I vacuum and reindex database regularly.
> >> >>
> >> >> When I start the dspam, processing an email tooks 40-50 sec at the
> >> >> beginning than drops to 10sec.
> >> >> If I made this test with more powerful server(quad core cpu with 16GB
> >> >> memory). it takes 0.01 secs.
> >> >> I belive that the problem with the small server about large database
> >> >> entries. but I would like to get better performance
> >> >> on the small server as well. Any idea?
> >> >>
> >> >> Do you think that sqlite might be better then pgsql on this setup? or
> >> >> did I train dspam with alots of spam/ham?
> >> >>
> >> >> Thanks.
> >> >>
> >> >
> >> > Hi Ibrahim,
> >> >
> >> > Are these 7 million tokens for a single user? What tokenizer are you
> >> > using: WORD, CHAIN, MARKOV/OSB, MARKOV/SBPH? That seems like an awful
> >> > lot of training. The docs usually recommend 2k messages each of ham
> >> > and spam. When we generated a base corpus for our user community,
> >> > we pruned the resulting millions of tokens down to about 300k. Another
> >> > thing that can help is to cluster your data on the uid+token index.
> >> > It looks like you cannot keep the full active token pages in memory
> >> > with only a 4GB system. Look at your paging/swapping stats. You may
> >> > be able to reduce your memory footprint which should help your 
> >> > performance.
> >> > Do you have your FILL FACTOR set to allow HOT updates?
> >> >
> >> > Cheers,
> >> > Ken
> >> >
> >>
> >> _______________________________________________
> >> Dspam-user mailing list
> >> Dspam-user@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/dspam-user
> >>
> >
> > ------------------------------------------------------------------------------
> > Enable your software for Intel(R) Active Management Technology to meet the
> > growing manageability and security demands of your customers. Businesses
> > are taking advantage of Intel(R) vPro (TM) technology - will your software
> > be a part of the solution? Download the Intel(R) Manageability Checker
> > today! http://p.sf.net/sfu/intel-dev2devmar
> > _______________________________________________
> > Dspam-user mailing list
> > Dspam-user@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dspam-user
> >
> 

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to