I'm worry if you're finding this discussion to be such a time waster..  My
goal is to improve ASSP for all, not to waste your time, you must know
that.

In the interest of conserving your time - summary question:
*Wouldn't it be better for ASSP to remove duplicate file names in excess of
X from notspam than for it to not remove them and instead remove other
files with more varied notspam content?*

Expanded:

It's helpful for me to now understand that the hmm/bayes analysis doesn't
weigh repetition more heavily than just one file in the opposite folder.
Thank you for that explanation.  When the users do mail merges, a lot of
the time, the body is subtly different (different dear line or other per
person customization for example), but based on what you're saying, I'd
think that they'd be substantially similar enough to act the same way as
you describe.  So good.

But, what is the downside of having ASSP remove filenames with more than X
of the same in notpsam?  I understand that having more wouldn't increase
scoring for the content of >those< messages, but wouldn't it also remove
say 5000 of the OTHER files that we want during a rebuild based on their
age, and therefore give us a file store that's not as diverse as it could
be?  Isn't that a bad thing or at least not as good as it would be if the
duplicate file name emails were removed?

Localfrequency isn't going to help, at least in my case.  If the director
wants to ignore my instruction and policy, she needs to be able to. Yes,
this is a policy problem, but the people high up in the charity will always
argue that if they need to send a message, they're going to.  They don't
pay me much, but it's a job that I need - I can't risk that by turning on
localfrequency.  I don't see how nocollecting / re is going to help.  I
have no way of knowing who is going to send next or what they're going to
send.

I guess I just don't see the downside (other than your time in coding and
testing along with a slightly longer cleanup process) to have ASSP remove
those duplicate file names at cleanup time, before removing oldest first.
Wouldn't that be better than having a notspam folder that once cleanup runs
could only have only a handful of files that are significantly different
content (say if a couple users sent a boatload of mailmerges in 1 day)?






On Mon, Mar 21, 2016 at 12:49 PM, Thomas Eckardt <thomas.ecka...@thockar.com
> wrote:

> bonehead user sends 5000 -> LocalFrequencyInt and next configs
>
> regular user sends 5000 -> noCollecting , noCollectRe ...........
>
> This is not a coding task - this is an organizing and configuration task.
> As I always say - RTMF!
>
> >then delete as you already do files in
> >excess of the maximum total number of files?
>
> Oldest fist - no content check.
>
> >that our notspam corpus remains diverse
>
> having 5000 times the 100% same mail-body in one folder is the same, like
> having the mail one time in this folder for HMM and bayes
> having the same mail in the opposit folder one time - elimiates all the
> 5000 for HMM and bayes
> BTW : this is independend from the filename or subject
>
> This is not new (since more than 10 years) - because it is one of the
> basic concepts of HMM and bayes.
>
> >I know that we must be missing something significant.
>
> Yes - the concept!
>
> You waste my time Ken.
>
> Thomas
>
>
>
> Von:    K Post <nntp.p...@gmail.com>
> An:     ASSP development mailing list <assp-test@lists.sourceforge.net>
> Datum:  21.03.2016 16:41
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> -From Thomas, posted elsewhere
> >Remains the (my) question - what should be done with mails that
> >reaches the 'MaxAllowedHamDups' without breaking any concept and without
> >creating a new folder (which breaks several concepts)?
>
> The scenario where a bonehead user sends 5000 of the same message in an
> Outlook mailmerge isn't just a conceptual possibility, it happens.  And
> it's happening more and more frequently despite training, memos,
> reminders,
> and a very good email blast system in place that eliminated the need for
> mailmerges.
>
> What about when doing the nightly cleanup if you were to delete files with
> the same name in excess of max dups, then delete as you already do files
> in
> excess of the maximum total number of files?  I thought that was what was
> already happening with the spam corpus, but apparently not.
>
> I only see upside to limiting the number of dups it notspam, but you've
> stated elsewhere that the arguments herein don't make sense to you.  If
> you're saying what we suggest doesn't make any sense, I know that we must
> be missing something significant.  I know that bayesian filtering works
> really well, but I only understand the inner workings from 35,000 feet. I
> just can't understand how making every effort to insure that our notspam
> corpus remains diverse doesn't make sense.
>
> Thanks again.  Hope we can continue this discussion.
>
> On Mon, Mar 14, 2016 at 5:28 PM, K Post <nntp.p...@gmail.com> wrote:
>
> > On of our staff inadvertently sent about 3400 of the same test messages
> > out through our server.  Okay, okay, it was me - had a loop coded wrong
> and
> > before I noticed what was going on and could stop it about 3400 of the
> same
> > messages went out, fortunately, they were just to me.  Sure enough, all
> > 3400 were in notspam.
> >
> > So, could we, and does it make sense, to keep discussing this?
> >
> > On Thu, Mar 10, 2016 at 1:47 PM, K Post <nntp.p...@gmail.com> wrote:
> >
> >> Isn't that exact same logic an argument for having the maximum number
> of
> >> duplicate subjects apply to the HAM / notspam folder too?  5000 or
> 15000 of
> >> the same message sent individually by (untrainable / apathetic) users
> would
> >> fill the notspam folder and mess up HMM / Bayesian right?
> >>
> >> And for those RE / FWD / No subject emails, maybe we could have ASSP
> >> ignore subjects shorter than say 5 or 6 characters when deleting
> duplicate
> >> file names?  Then those files could get wiped out oldest first during
> the
> >> maintenance.
> >>
> >> \
> >>
> >> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> >> thomas.ecka...@thockar.com> wrote:
> >>
> >>> Just think about the logic behind Bayesian and HMM - this will answer
> >>> your
> >>> question.
> >>>
> >>> Having the same mail in the spam folder multiple times, this will
> score
> >>> the content to extreme spam havy, even your users are using the same
> >>> content - but less often.
> >>>
> >>> Thomas
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Von:    K Post <nntp.p...@gmail.com>
> >>> An:     ASSP development mailing list
> <assp-test@lists.sourceforge.net>
> >>> Datum:  10.03.2016 16:58
> >>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> >>>
> >>>
> >>>
> >>> I know you're all RTFM, but there's plenty of places in the GUI where
> the
> >>> description isn't exactly clear or right.  For example
> >>>
> >>> MaxFiles
> >>> If you're not using subjects as file names ( UseSubjectsAsMaillogNames
> ),
> >>> this is the maximum number of files to keep in each collection (spam &
> >>> nonspam)
> >>> It's actually less than this -- files get a random number between 1
> and
> >>> MaxFiles.
> >>>
> >>> I AM using file names and MaxFiles DOES control the maximum number of
> >>> files
> >>> in each collection, despite what the description says when
> >>> MaintBayesCollection is on and no max age is set. The language is not
> >>> clear
> >>> and that makes us assume things, sometimes incorrectly, about what the
> >>> GUI
> >>> really mean.  We've been working this way since ASSP came out. Because
> >>> of
> >>> this, I had no way of knowing that MaxAllowedDups >really< only
> applied
> >>> to
> >>> the spam collection.  I assumed the GUI meant the whole log of spam
> and
> >>> NOTspam.  I don't think that's an unreasonable assumption, or call it
> an
> >>> oversight, or a mistake on my part - but none of that justifies and
> angry
> >>> sounding response from you.
> >>>
> >>>  I'm not looking for a fight, but I feel like I have to keep
> justifying
> >>> myself after you appear to be so angry with me, and the rest of us,
> who
> >>> turn to you for enlightenment.  You're carrying the entire weight of
> this
> >>> project on your shoulders.  It's a lot, I know,  Can we move on and
> have
> >>> a
> >>> reasonable discussion here?
> >>>
> >>> Is there a reason that MaxAllowedDups shouldn't also apply to the
> notspam
> >>> collection?   Shouldn't we want that to be the case for the same
> reason
> >>> that we have it for spam?   Maybe also to the errors collections?
> >>>
> >>> If we don't, wouldn't the case where a staff member sends the same
> basic
> >>> message to 5000 people (against my wishes, but I can't control
> >>> everything)
> >>> that'll take 1/3 of the other notspam messages out of the rebuild
> >>> processes?  How about if 20k messages are sent?
> >>>
> >>> Maybe I'm just not understanding, and that's why I'm asking, but I
> hope
> >>> it
> >>> doesn't result in any more scolding.
> >>>
> >>> Thank you
> >>>
> >>>
> >>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> >>> <thomas.ecka...@thockar.com>
> >>> wrote:
> >>>
> >>> > >There are about 600 of those files in NotSpam.
> >>> >
> >>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> >>> >   'The maximum number of logged files with the same filename
> (subject)
> >>> > that are stored in the spam folder (spamlog),........
> >>> >
> >>> > I'll write in Hebrew - possibly the english is better, if you
> translate
> >>> it
> >>> > back to english.
> >>> >
> >>> > Thomas
> >>> >
> >>> >
> >>> >
> >>> > Von:    K Post <nntp.p...@gmail.com>
> >>> > An:     ASSP development mailing list
> <assp-test@lists.sourceforge.net
> >>> >
> >>> > Datum:  10.03.2016 00:29
> >>> > Betreff:        [Assp-test] Max Number Duplicate File Names
> >>> >
> >>> >
> >>> >
> >>> > I've got UseSubjectAsMaillogNames checked (the messages are stored
> in
> >>> the
> >>> > folders user the subject name followed by a 6 digit number as
> expected)
> >>> >
> >>> > I've got MaxAllowedDups set to 3
> >>> >
> >>> > MaxBayesFileAge is 0
> >>> > MaxFiles is 15000
> >>> >
> >>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> >>> >
> >>> > For example, a couple users often send emails with the subject
> >>> > "Your Donation Receipt"
> >>> > There are about 600 of those files in NotSpam.
> >>> > Your_Donation_Receipt--123456.txt
> >>> > where 123456 is a random differing number.
> >>> >
> >>> > Shouldn't only 3 of these files exist in the folder (with the
> exception
> >>> of
> >>> > those that were sent since the rebuild / maintenance window)?
> >>> >
> >>> > Thanks
> >>> >
> >>> >
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> > Transform Data into Opportunity.
> >>> > Accelerate data analysis in your applications with
> >>> > Intel Data Analytics Acceleration Library.
> >>> > Click to learn more.
> >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> > _______________________________________________
> >>> > Assp-test mailing list
> >>> > Assp-test@lists.sourceforge.net
> >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > DISCLAIMER:
> >>> > *******************************************************
> >>> > This email and any files transmitted with it may be confidential,
> >>> legally
> >>> > privileged and protected in law and are intended solely for the use
> of
> >>> the
> >>> >
> >>> > individual to whom it is addressed.
> >>> > This email was multiple times scanned for viruses. There should be
> no
> >>> > known virus in this email!
> >>> > *******************************************************
> >>> >
> >>> >
> >>> >
> >>> >
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> > Transform Data into Opportunity.
> >>> > Accelerate data analysis in your applications with
> >>> > Intel Data Analytics Acceleration Library.
> >>> > Click to learn more.
> >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> > _______________________________________________
> >>> > Assp-test mailing list
> >>> > Assp-test@lists.sourceforge.net
> >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >>> >
> >>> >
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> Transform Data into Opportunity.
> >>> Accelerate data analysis in your applications with
> >>> Intel Data Analytics Acceleration Library.
> >>> Click to learn more.
> >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> _______________________________________________
> >>> Assp-test mailing list
> >>> Assp-test@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> >>>
> >>>
> >>>
> >>>
> >>> DISCLAIMER:
> >>> *******************************************************
> >>> This email and any files transmitted with it may be confidential,
> legally
> >>> privileged and protected in law and are intended solely for the use of
> >>> the
> >>>
> >>> individual to whom it is addressed.
> >>> This email was multiple times scanned for viruses. There should be no
> >>> known virus in this email!
> >>> *******************************************************
> >>>
> >>>
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> Transform Data into Opportunity.
> >>> Accelerate data analysis in your applications with
> >>> Intel Data Analytics Acceleration Library.
> >>> Click to learn more.
> >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> _______________________________________________
> >>> Assp-test mailing list
> >>> Assp-test@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> >>>
> >>>
> >>
> >
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to