Re: Spam package redesign

Andrew Jaquith Fri, 25 Sep 2009 16:30:57 -0700

You could certainly do that with the package as I've described it --
the pseudo-subject or facet is what I called a Score "category," if I
catch your meaning.


I'll take this as a vote of confidence -- when I check it in (in the
next few weeks probably), you'll be able to see the code for yourself.
:)

Andrew

On Sat, Sep 26, 2009 at 12:17 AM, Murray Altheim <[email protected]> wrote:
> Andrew Jaquith wrote:
>>
>> ** Warning: long post **
>>
>> After some fooling around and some actual work, I've finished my first
>> pass at refactoring on the anti-spam code. I'm proposing a new
>> package, org.apache.wiki.content.inspect, which contains a
>> general-purpose content-inspection capability, of which spam is just
>> one potential application. Here is a draft of the package javadocs.
>
> [...]
>>
>> I can foresee other uses for this too, for example general-purpose
>> content classification. But that's for another day.
>>
>> Comments, thoughts? It's going to take some time to get unit tests
>> done, so I won't be committing this for a little while.
>
> Hi Andrew,
>
> This sounds pretty impressive, all in all. With my library hat on, my
> interest was piqued by the idea of using this for non-spam applications,
> so the only comment I have at this point is wondering how you might at
> this point include the hook into Lucene.
>
> The way I'd see this working would be as follows.
>
> I'd not want to overload the Dublin Core Subject, but as a sort of
> informative field that might actually be used to populate the Subject.
> The structure of the result of the inspection would be a map of
> pseudo-subject (facet?) identifiers and a scope for each, e.g.,
>
>  Subject:         Shipping, Shipwrecks, Transportation
>  Pseudo-Subject:  Lusitania                             Score: 0.67
>  Pseudo-Subject:  http://en.wikipedia.org/wiki/Titanic  Score: 0.89
>  Pseudo-Subject:  Storm                                 Score: 0.56
>  Pseudo-Subject:  Mermaid                               Score: 0.24
>
> Where the "pseudo-subject" can be either a string or a URI subject
> identifier. And noting that "pseudo-subject" is not a term of art and
> I'd hope to come up with something more suitable. One could then use
> some mathematically-sensible composite of the scores to obtain the
> overall score for the document. You could even choose subsets of the
> pseudo-subjects to obtain targeted scores. This would still work for
> spam detection but would potentially be very powerful for subject
> classification, especially if it was tied into the search functionality.
>
> Does this make any sense?
>
> Murray
>

Re: Spam package redesign

Reply via email to