Andrew Jaquith wrote:
** Warning: long post **
After some fooling around and some actual work, I've finished my first
pass at refactoring on the anti-spam code. I'm proposing a new
package, org.apache.wiki.content.inspect, which contains a
general-purpose content-inspection capability, of which spam is just
one potential application. Here is a draft of the package javadocs.
[...]
I can foresee other uses for this too, for example general-purpose
content classification. But that's for another day.
Comments, thoughts? It's going to take some time to get unit tests
done, so I won't be committing this for a little while.
Hi Andrew,
This sounds pretty impressive, all in all. With my library hat on, my
interest was piqued by the idea of using this for non-spam applications,
so the only comment I have at this point is wondering how you might at
this point include the hook into Lucene.
The way I'd see this working would be as follows.
I'd not want to overload the Dublin Core Subject, but as a sort of
informative field that might actually be used to populate the Subject.
The structure of the result of the inspection would be a map of
pseudo-subject (facet?) identifiers and a scope for each, e.g.,
Subject: Shipping, Shipwrecks, Transportation
Pseudo-Subject: Lusitania Score: 0.67
Pseudo-Subject: http://en.wikipedia.org/wiki/Titanic Score: 0.89
Pseudo-Subject: Storm Score: 0.56
Pseudo-Subject: Mermaid Score: 0.24
Where the "pseudo-subject" can be either a string or a URI subject
identifier. And noting that "pseudo-subject" is not a term of art and
I'd hope to come up with something more suitable. One could then use
some mathematically-sensible composite of the scores to obtain the
overall score for the document. You could even choose subsets of the
pseudo-subjects to obtain targeted scores. This would still work for
spam detection but would potentially be very powerful for subject
classification, especially if it was tied into the search functionality.
Does this make any sense?
Murray