Re: Spam package redesign

Andrew Jaquith Mon, 28 Sep 2009 11:25:10 -0700

There's some room for optimization in what you proposed, but your
p-code is close to what I was thinking. in essence, it is the Strategy
pattern that calls the inspector logic, combined with a listener for
scoring changes (aka Visitor pattern) that interested callers would
subscribe to. That works.


After having originally suggested an "InspectionInterruptedException"
so that inspectors could abort a running inspection chain, I now think
it's better to simply return an object from the listener/visited
method, rather than throw an exception. Exceptions could be perceived
by an implementer as a little trickier to debug. I'd rather make the
API simpler.

FYI, the workflow package isn't totally suitable here. When you cited
the Decision class, I suspect you actually meant the Outcome class,
no?

Ok. I've got what I need to do this properly. This should work pretty well.

Andrew

On Mon, Sep 28, 2009 at 3:11 AM, Janne Jalkanen
<[email protected]> wrote:
>
> Yah, it's just that smaller numbers are usually easier to deal with (numbers
> from 1-20).  We need more resolution at the low end, not at the high end.
>  That's my preference for using floats.
>
> A simple way to do as you suggest would be to have a Score return value,
> e.g.
>
> public Score inspect(...)
> {
>    // Detection score goes here
>
>    if( isSpam )
>        return new Score( "mywonderfulspamdetectortest", 0.5f );
>
>    return Score.OK; // where score.OK is defined to be 0.0f.
> }
>
> where the first String is a key and the float is a default value, which can
> be overridden in a config file using the key. This would be pretty close to
> the SpamAssassin model (and close to the java.util.Properties conceptually,
> so it would be easy to adapt).
>
> (The reason why I'm proposing a default value is so that when you add a new
> Inspector, you don't have to modify two files.)
>
> After each Inspector would be called, then the parent would call the
> callback I proposed, and that would then determine what to do.  In
> pseudocode:
>
> // JSPWiki internal code.
> private void inspect( DecisionMaker decisionMaker, Change change, <some
> other values> )
> {
>    ArrayList scores;
>    for( Inspector inspector : m_inspectorChain )
>    {
>        Score s = inspector.inspect( change, <some values> );
>
>        scores.add( s );
>
>        if( decisionMaker.visit( scores ) == Decision.SPAM ) { ... }
>    }
> }
>
> We can then provide something like this
>
> class SpamFilter implements PageFilter, DecisionMaker
> {
>    ...
>
>    // This emulates the 2.8.x SpamFilter functionality.
>    public Decision visit( List<Score> scores )
>    {
>        if( scores.last().getValue() > 0.0 ) return Decision.SPAM;
>
>        return Decision.CONTINUE;
>    }
> }
>
> (At any rate, it might make sense to reuse some classes from the workflow
> package here; after all, this is a machine workflow in a sense, is it not?)
>
> (Another possible pattern would be to have the DecisionMaker throw an
> exception. Don't quite know what's better.)
>
> /Janne
>
> On Sep 28, 2009, at 02:04 , Andrew Jaquith wrote:
>
>> It's only more complicated in the sense that I'd have to change some
>> recently-written code. :)
>>
>> Both integer and floating point -- as you point out -- are really the
>> same except for the choice of where to put the decimal point. I chose
>> integer to preserve the same scoring strategy from the previous
>> design. Basically: add stuff up, and if the total exceeds a theshold,
>> you've got spam.
>>
>> I took a quick look at SpamAssassin. They do use floating point
>> scores. I like the way they allow weights to be customized for each
>> rule. That seems like it would be a good enhancement for us, too.
>>
>> So, it seems to me that if we go floating-point, we should make it
>> possible to configure custom weights for each Inspector. That also
>> implies that it would be better to simply have Inspector.inspect()
>> simply return a "pass/fail/no result" result and have the parent
>> Inspection object figure out what to add or subtract from the running
>> total. This would mean that the Inspectors wouldn't be able to modify
>> scores directly. And it would be closer to the SpamAssessin model, and
>> would probably simpler to code for too.
>>
>> Reasonable?
>>
>> On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
>> <[email protected]> wrote:
>>>
>>> Heya!
>>>
>>> I don't quite understand - why would a floating point value be any more
>>> complicated than an integer?
>>>
>>> Note that at the moment we mark a modification a spam if a single test
>>> fails, which essentially means doing logical OR more than any sort of an
>>> addition (=threshold is always 1).  The integer is there more or less as
>>>
>>> something-which-was-never-completed-and-I-am-almost-sure-is-the-wrong-design.
>>> In fact, I did lament the fact that you can't say that "X is a bit more
>>> efficient than Y" - with an integer-based design you essentially go for
>>> fixed point arithmetics (test A gives you 100, test B gives you 150, etc,
>>> where you just shifted decimal points to the right just for the heck of
>>> it).
>>>
>>> If you take a look at SpamAssassin, it uses floats.
>>>
>>> I'd hesitate changing it later, since it will create an API contract.
>>>
>>> /Janne
>>>
>>> On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
>>>
>>>> Janne --- good critique, and good suggestions. I've killed the
>>>> limit-setting methods on the Inspection class itself in favor of a
>>>> listener
>>>> strategy. I adopteed your other suggestions too, except...
>>>>
>>>> ...at the moment I am inclined to leave the scoring scale as
>>>> integer-based
>>>> for now. I like it because it is simple, and resembles what we do now.
>>>> We
>>>> can change it later if need be.
>>>>
>>>> In answer to your first question: the inspector package does not help
>>>> graduation, although it does help the beta that comes after the diploma.
>>>> :)
>>>>
>>>> Andrew
>>>>
>>>> On Sep 27, 2009, at 4:37, Janne Jalkanen <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> My main concern: does this advance our graduation?
>>>>>
>>>>> Others inline.
>>>>>
>>>>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>>>>
>>>>>> The "inspect" package provides facilities that allow content to be
>>>>>> inspected and scored based on various criteria such as whether a wiki
>>>>>> page change contains spam. Content may be scored by initializing an
>>>>>> Inspection object and then calling the {...@link Inspection#inspect()}
>>>>>> method. The {...@code inspect} method, in turn, iterates through a
>>>>>> variety of previously-configured {...@link Inspector} objects and calls
>>>>>> the {...@link Inspector#inspect(String,String)} method for each one. Each
>>>>>> of the configured inspectors is free to perform whatever evaluations
>>>>>> it wishes, and can increase, decrease or reset the "scores" for any
>>>>>> score category, for example the {...@link Score#SPAM} category.
>>>>>
>>>>> Ok.
>>>>>
>>>>>> * Score objects supply instructions to the parent Inspection object to
>>>>>> increment, decrement or reset the score for a particular category.
>>>>>> Each Score object is constructed with a category (for example,
>>>>>> Score.SPAM), an integer indicating how much to change the score, and
>>>>>> an optional String message that provides context for the change. For
>>>>>> example, a Score that increments the spam score by 1 could be
>>>>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>>>>>> numbers can be supplied also to decrease the score. For convenience,
>>>>>> {...@link Score#INCREMENT_SCORE} means "add 1", {...@link
>>>>>> Score#DECREMENT_SCORE} means "subtract 1", and {...@link Score#RESET}
>>>>>> means "reset to zero."
>>>>>
>>>>> Smells of overdesign to me.  Creating a new operation and methodology
>>>>> to
>>>>> support addition sounds like a bad idea. Not to mention the fact that
>>>>> you
>>>>> might actually want to use fractions.
>>>>>
>>>>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds more
>>>>> appealing to me.
>>>>>
>>>>>> public String preSave( WikiContext context, String content ) throws
>>>>>> RedirectException
>>>>>> {
>>>>>> Change change = getChange( context, content );
>>>>>>
>>>>>> // Run the Inspection
>>>>>> Inspection inspection = new Inspection( context, m_config,
>>>>>> m_inspectors
>>>>>> );
>>>>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>>>>> inspection.inspect( content, change );
>>>>>> int spamScore = inspection.getScore( Score.SPAM );
>>>>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>>>>
>>>>>> // Redirect user if score too high
>>>>>> if( inspection.isFailed() )
>>>>>> {
>>>>>>  ...
>>>>>> }
>>>>>> ...
>>>>>> }
>>>>>
>>>>> Yes, this looks unnecessarily complicated and limiting to me.  I would
>>>>> remove automatic limit-rating altogether, and would just concentrate on
>>>>> refactoring the individual methods from SpamFilter into the
>>>>> inspect-package
>>>>> with a light wrapper around them.  I think a callback design is better
>>>>> than
>>>>> adding quite limited limit thingies anyway, as this gives full
>>>>> programmatic
>>>>> control to the developer without us having to pre-guess the possible
>>>>> limitations.
>>>>>
>>>>> E.g. inspection.inspect( content, change, new
>>>>> InspectionResultListener()
>>>>> {
>>>>>  public boolean visit(Inspection ins)
>>>>>  {
>>>>>      // Any possible decision code goes here
>>>>>      return ins.getScore() > m_maxScore;
>>>>>  }
>>>>> });
>>>>>
>>>>> Obviously, these listeners would be optional, and you could just let
>>>>> the
>>>>> entire chain run without interference. And for simplicity, we can
>>>>> provide
>>>>> some default listeners, e.g.
>>>>>
>>>>> inspection.inspect( content, change, new StopAtScoreListener(
>>>>> Score.SPAM,
>>>>> 5 ) );
>>>>>
>>>>> (Yeah, I'm a big fan of this pattern 'cos it gives much more control to
>>>>> the developer and eases debugging immensely.)
>>>>>
>>>>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
>>>>>
>>>>> This should obviously be called AkismetInspector; all the others detect
>>>>> spammers too (and Akismet is probably the worst of the lot anyway).
>>>>>
>>>>> /Janne
>>>
>>>
>
>

Re: Spam package redesign

Reply via email to