I don't believe that I'm saying this but - it's clear! Thank you. On Wed, Sep 16, 2009 at 1:40 PM, Thomas Eckardt/eck < thomas.ecka...@thockar.com> wrote:
> >Does this mean that I can delete all of the > files that were used to build the current spamdb? > > Yes. > > >1 only one message > to the server and it's spam. This will be added to the spamdb right? > > No - too less files - but if more files exists - yes > > >If so, let's just say for simplicity, that then next day, 1 only one > message > to the server and it's spam. This will be added to the spamdb right? > > No - the records from this file already exists / Yes if the new > constalation of all folders results in a new weight value for existing > word pairs - we must have the chance to correct records. > > >The next day, no messages come in. What happens for the rebuild? Does > the > single file in the spam folder get added again? > > No / Yes - same reason > > >With adding to the spamdb, does this mean that the db > will still have the initial index of this file as spam, but will counter > it > with a higher weight from the corrected folder? > > Yes - or you correct it reverse, how ever if values for word pairs are > different to the spamdb - the new values will be used. > > > Thomas > > > > > > K Post <nntp.p...@gmail.com> > 16.09.2009 19:25 > Bitte antworten an > ASSP development mailing list <assp-test@lists.sourceforge.net> > > > An > ASSP development mailing list <assp-test@lists.sourceforge.net> > Kopie > > Thema > Re: [Assp-test] Antwort: Re: Antwort: Re: Antwort: Re: Antwort: Re: > Antwort: Re:fixesandnewsin 2.0.1_RC0.4.12 > > > > > > > Wow- First, Thomas, thanks for taking the time yet again to discuss this. > > Your explanation this time finally got through to me! I get it now, at > least most of it... > > I must have misunderstood one explanation or another and mistakenly > thought > that assp doesn't pay attention to duplicates. Thank for pointing that > out. > > If MaintBayesCollection is already deleting extra files beyond maxfiles, > why > do we need or even want MaxBayesFileAge and MaxCorrectedDays? > > I'll raise our maxfiles limit, which will make the rebuild process take > longer (which is fine). Then I won't care about duplicate subject names > any > more and I'll let the rebuild process work its magic. > > If deleteing randomly is only bad because it will break block reporting, I > respectfully disagree. We could put in a minimum TTL like 30 days. Anyone > who is requesting an email older than that >>might<< be out of luck if it > was randomly deleted. > > I could use some more explanation on adding to the spamdb each time > instead > of replacing it, if you don't mind: > > Let's start simple, assuming that I've got the perfect spamdb already and > I've turned on the option to add to the db instead of replacing it. > Correct > me where I (am surely) wrong. Does this mean that I can delete all of the > files that were used to build the current spamdb? > > If so, let's just say for simplicity, that then next day, 1 only one > message > to the server and it's spam. This will be added to the spamdb right? > > The next day, no messages come in. What happens for the rebuild? Does > the > single file in the spam folder get added again? > > Then the third day, again no messages come in, but we realize that the one > spam message actually isn't spam. Previously, I'd move this file to the > error/spam folder. With adding to the spamdb, does this mean that the db > will still have the initial index of this file as spam, but will counter > it > with a higher weight from the corrected folder? > > Thanks again Thomas. If ASSP didn't work so well already, I wouldn't have > all this time to have this discussion because I'd be sorting through my > spam. :) > > > > > > > On Wed, Sep 16, 2009 at 5:15 AM, Thomas Eckardt/eck < > thomas.ecka...@thockar.com> wrote: > > > >If we do this, we can remove what's probably the same message > > > > rebuildspamdb parses the content of all files (except marked for > deletion) > > and is doing a MD5 hash over the MIME-decoded and cleaned up body - so > we > > know exactly (not probably!!) if the same message content was processed > > before. > > If so, we ignoring all other equal messages. If the content is not equal > > but similar the processing of such files makes the spamdb more perfect - > > which is the target! > > > > >since duplicates aren't considered by assp > > > > This is not and was never the case!!! Where has you got this 'knowledge' > > from? > > > > >and then delete randomly > > > > If we do anything randomly - we lose controll - I prefer using an > > intelligent way, even if it takes more time. > > > > > > But you are right - it could be possible that many messages with the > same > > subject could be collected. This is because, assp (V2) adds a unique > > (counter 1 - 999.999) number to the filename (since RC0.2.xx) if a file > is > > collected in discarded folder or 'UseSubjectsAsMaillogNames' is > selected. > > (In the early version this was not a problem, The next message with the > > same subject has overwitten the file of the last one.) > > This is done for the following reasons: it could be possible that more > > than one SMTP-worker receives a message with the same subject at the > same > > time and all want to open/write the same collection file at the same > time > > - which could cause stucking workers or unexpected restarts and > > if a file was overwritten and a user requested a resend of a blocked > mail, > > it was possible that he has got a wrong mail. > > > > There is a good reason to delete oldest files first - because (I've told > > you before) randomly deletion of files (completely ignoring the file > age) > > will possibly break the BlockReport - resend function. > > > > But the discussion passes the concept of the rebuildspamdb in V2. > > > > The concept of the rebuildspamdb in V2 - I hope it is clear enough: > > > > What should you do?: > > 1) > > Collect a good corpus - ignoring the file age (MaintBayesCollection == > 0) > > - maintain the correction folders (doing/try your very best !!!). This > > will result in a good spamdb. You may also use a good spamdb from your > > friend. > > 2) > > >From this time, there is no need to have old files in corpus, because > we > > have the result of that files in our spamdb. Now set ReplaceOldSpamdb to > 0 > > and in future only new and corrected word pairs (we have to pay > attention > > to the correction folders) are written in to spamdb. To maintain the > > number of files in the corpus to your needs, setup MaintBayesCollection, > > MaxBayesFileAge, MaxCorrectedDays, MaxNoBayesFileAge and > > MaxFileAgeSchedule to your best values. Keep an eye on the correction > > folders to prevent bad corrections! > > > > More details!: > > As you can see, the concept has got a major change - we use not only the > > corpus - we use the existing spamdb and the corpus, which is much more > > exact. Large parts of the long term memory of our corpus are moved in to > > the spamdb. Or better, our long term memory has extremly increased. So > > even if the corpus is getting bad or corrupt because of wrong > > collection/correction, new spammers behavior or any other coincidence - > > our spamdb will be in a consistent actual state. We left noting to > chance, > > by doing anything randomly! > > > > If someone wants to use the old concept (build a completely new spamdb > > depending on the currently existing more or less randomly build corpus) > - > > this is possible by leaving ReplaceOldSpamdb on the default value (1) > and > > deselect MaintBayesCollection. But in this case, also the old > collection > > concept should be used (deselect UseSubjectsAsMaillogNames or use > > doMove2Num) and you have to accept, that some of the new features will > not > > work as expected (for example: BlockReports - resend). > > > > Maybe the config description is not clear enough to understand the > > concepts - but a description made by a developer is never the best. > > > > hope this helps > > > > Thomas > > > > > > > > > > > > > > > > K Post <nntp.p...@gmail.com> > > 15.09.2009 23:55 > > Bitte antworten an > > ASSP development mailing list <assp-test@lists.sourceforge.net> > > > > > > An > > ASSP development mailing list <assp-test@lists.sourceforge.net> > > Kopie > > > > Thema > > Re: [Assp-test] Antwort: Re: Antwort: Re: Antwort: Re: Antwort: > > Re:fixesandnewsin 2.0.1_RC0.4.12 > > > > > > > > > > > > > > I'll try to simplify my discussion a bit > > > > 1) It's my understanding that currently files are only deleted with > > subject > > logging on and move2numb off by date. Yes? I want to see random > deletion > > in 0.4.14 > > > > 2) We agree that deletion by date isn't the best for bayesian filtering > > yes? If so, then I want to keep the number of files closer to maxfile > by > > first removing what is probably a duplicate email. Easiest way to do > this > > that I've thought of: delete based on subject names. If we do this, we > > can > > remove what's probably the same message, and then delete randomly to get > > down to the maxfiles number of files. That'll leave more unique > messages > > which is important since duplicates aren't considered by assp. > > 3) I'm confused by the MaintBayesCollection option. I use bayesian, I > do > > NOT want the folders to have files removed automatically, oldest first > to > > get to maxfiles. I want to do it by subject trimming first, then > > randomly. > > My point previously is that the description in admin for > > MaintBayesCollection suggests that files will be deleted by date. THis > > doesn't have anything to do with MaxNoBayesFileAge, etc does it? The > max > > file age options say things like "A value of 0 disables this feature and > > no > > file will be deleted because of its age" but does this override the > > processing that the admin servers says will happen if > maintbayescollection > > is checked? (deleting based on age to get to maxfiles) > > > > 4) You don't have the min option in ASSP now do you? I think that Brett > > and > > I are basically saying the same thing here. I like the TTL language, > > though > > min would be more consistent IMO. > > On Tue, Sep 15, 2009 at 1:31 PM, Thomas Eckardt/eck < > > thomas.ecka...@thockar.com> wrote: > > > > > I do not understand the discussion ! > > > > > > There are all wishes build in (assp) except removing mails with the > same > > > subject - I do not love this idea, because the subject is ignored by > > > rebuildspamdb - only the body is used and mails with the same body are > > > ignored (except one) and will be deleted 60 days later . > > > > > > ------------------------------------------- > > > ['MaintBayesCollection','Maintenance for Bayesian > > > Collection',0,\&checkbox,'','(.*)',undef, > > > 'Set this to on, if you want ASSP to run a maintenance tasks on the > > > bayesian collection folders ( spamlog , notspamlog , correctedspam , > > > correctednotspam ). ASSP will delete the oldest files until the number > > of > > > files per folder reaches MaxFiles. If you want ASSP to delete files > > > because of their age instead of the number of files ( MaxFiles ), > setup > > > MaxBayesFileAge and/or MaxCorrectedDays to your needs.<br /> > > > This option is usefull, if UseSubjectsAsMaillogNames is set to on and > > > doMove2Num is set to off, because in this case the number of files in > > > every collection folder will grow > > > infinite.',undef,undef,'msg006140','msg006141'], > > > > > > ['MaxBayesFileAge','Max Age of Bayes > > > Files',10,\&textinput,0,'(\d+)',undef, > > > 'The maximum file age in days of every file in every bayesian > > collection > > > folder ( spamlog , notspamlog ). If MaintBayesCollection is set to on > > and > > > a file is older than this number in days, the file will be deleted. > > > Default is 0. A value of 0 disables this feature and no file will be > > > deleted because of its age.<br /> > > > <span class = "negative">Do not define this option, if you use the > > > bayesian engine of ASSP. Deleting files because of there age, is wrong > > in > > > this case!!!!!</span>',undef,undef,'msg006150','msg006151'], > > > > > > ['MaxCorrectedDays','Max Corrected File > > > Age',5,\&textinput,'1000','(\d+)',undef,'This is the number of days a > > > error report will be kept in the correctednotspam and correctedspam > > > folders. These folders are the longterm memory of ASSP, therefore the > > > default is 1000 days. ',undef,undef,'msg008590','msg008591'], > > > > > > ['MaxNoBayesFileAge','Max Age of non Bayes > > > Files',10,\&textinput,0,'(\d+)',undef, > > > 'The maximum file age in days of every file in every non bayesian > > > collection folder ( incomingOkMail , discarded , viruslog ). If > defined > > > and a file is older than this number in days, the file will be > deleted. > > > Default is 0. A value of 0 disables this feature and no file will be > > > deleted because of its age.',undef,undef,'msg006160','msg006161'], > > > --------------------------------------------- > > > > > > If MaintBayesCollection is set to on -it is your choice to set the > rest > > to > > > your needs. > > > > > > - MaxBayesFileAge/MaxNoBayesFileAge == 0 - reduce the number > > of > > > files to maxfiles by deleting the oldest > > > - MaxBayesFileAge/MaxNoBayesFileAge != 0 - reduce the number > > of > > > files by deleting all that are older than XX > > > > > > -MaxCorrectedDays - this files should never be deleted (use 1000000) > > > > > > And keep in mind - if the number of files per folder is reduced to > > > maxfiles at 1:00 AM and rebuildspamdb is running at 11:00 PM - > > > rebuildspamdb has to process possibly much more than maxfiles! > > > > > > Currently there is a mistake in this maint-task: the files with the > > > filedate set to 60 days in future, are the last files that will be > > deleted > > > - this will be fixed in 4.14 > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > "GrayHat" <gray...@gmx.net> > > > 15.09.2009 18:35 > > > Bitte antworten an > > > GrayHat <gray...@gmx.net>; Bitte antworten an > > > ASSP development mailing list <assp-test@lists.sourceforge.net> > > > > > > > > > An > > > "ASSP development mailing list" <assp-test@lists.sourceforge.net> > > > Kopie > > > > > > Thema > > > Re: [Assp-test] Antwort: Re: Antwort: Re: Antwort: Re:fixesandnewsin > > > 2.0.1_RC0.4.12 > > > > > > > > > > > > > > > > > > > > > >> Hmm... that sounds like an idea which was brought on some > > > >> time ago (John was still the dev for ASSP at the time); that > > > >> is, set up some kind of TTL parameter for corpus files so > > > >> that the spamdb rebuild should check the file date/time and > > > >> if over the TTL (say "n" days) it should then delete the file. > > > > > > > My thought is that the "TTL" would only be in effect for the purpose > > > > of keeping BlockReporting working (for however many days or > > > > weeks you wish the emails to be guaranteed resendable). > > > > After that time, the TTL is null and the files are game for > > > > replacement. I thought it a simple idea for working around > > > > the BlockReporting problem Thomas mentioned. > > > > > > I see, but there's no need to store something along with files, > > > the regular filesystem timestamp for each file will just work > > > fine, just remove all files if "(today - filetime) > TTL" > > > > > > > On a low-to-medium traffic box, though, this would not be a > > > > problem. We already deal with bunches of identical > > > > messages from time-to-time (nothing new). > > > > > > there may be a solution for that too, assuming the spam and > > > notspam folders gets cleaned up using the TTL, the files may > > > be saved using (e.g.) an MD5 hash (or the like) as the name > > > so that identical messages won't be stored more than one > > > time; by the way that may have some side effects and may > > > need some more thinking but... > > > > > > >> Bottom line; the bayes filter should work by /learning/ this > > > >> means that it should NOT discard the previous data, but > > > >> rather REFINE them from further data coming in; so maybe the > > > >> whole bayes approach used inside ASSP should be revised NOT > > > >> to deal just with the latest data but to learn/improve during time > > > > > > > Just an idea, but how do you "NOT" discard data while keeping > > > > rebuild times low and maintaining free hard drive space > > > > (realistically)? > > > > > > Using some kind of "digest" of the previous bases stored in a > > > more compact format > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > Come build with us! The BlackBerry® Developer Conference in SF, CA > > > is the only developer event you need to attend this year. Jumpstart > your > > > developing skills, take BlackBerry mobile applications to market and > > stay > > > ahead of the curve. Join us from November 9-12, 2009. Register > > > now! > > > http://p.sf.net/sfu/devconf > > > _______________________________________________ > > > Assp-test mailing list > > > Assp-test@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > > > > > > > > > > > > DISCLAIMER: > > > ******************************************************* > > > This email and any files transmitted with it may be confidential, > > legally > > > privileged and protected in law and are intended solely for the use of > > the > > > > > > individual to whom it is addressed. > > > This email was multiple times scanned for viruses. There should be no > > > known virus in this email! > > > ******************************************************* > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > Come build with us! The BlackBerry® Developer Conference in SF, > CA > > > is the only developer event you need to attend this year. Jumpstart > your > > > developing skills, take BlackBerry mobile applications to market and > > stay > > > ahead of the curve. Join us from November 9-12, 2009. Register > > now! > > > http://p.sf.net/sfu/devconf > > > _______________________________________________ > > > Assp-test mailing list > > > Assp-test@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > > > > > > ------------------------------------------------------------------------------ > > Come build with us! The BlackBerry® Developer Conference in SF, CA > > is the only developer event you need to attend this year. Jumpstart your > > developing skills, take BlackBerry mobile applications to market and > stay > > ahead of the curve. Join us from November 9-12, 2009. Register > > now! > > http://p.sf.net/sfu/devconf > > _______________________________________________ > > Assp-test mailing list > > Assp-test@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > > > > > > > DISCLAIMER: > > ******************************************************* > > This email and any files transmitted with it may be confidential, > legally > > privileged and protected in law and are intended solely for the use of > the > > > > individual to whom it is addressed. > > This email was multiple times scanned for viruses. There should be no > > known virus in this email! > > ******************************************************* > > > > > > > > ------------------------------------------------------------------------------ > > Come build with us! The BlackBerry® Developer Conference in SF, CA > > is the only developer event you need to attend this year. Jumpstart your > > developing skills, take BlackBerry mobile applications to market and > stay > > ahead of the curve. Join us from November 9-12, 2009. Register > now! > > http://p.sf.net/sfu/devconf > > _______________________________________________ > > Assp-test mailing list > > Assp-test@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry® Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9-12, 2009. Register > now! > http://p.sf.net/sfu/devconf > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry® Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9-12, 2009. Register now! > http://p.sf.net/sfu/devconf > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test > ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test