You could certainly do that with the package as I've described it -- the pseudo-subject or facet is what I called a Score "category," if I catch your meaning.
I'll take this as a vote of confidence -- when I check it in (in the next few weeks probably), you'll be able to see the code for yourself. :) Andrew On Sat, Sep 26, 2009 at 12:17 AM, Murray Altheim <[email protected]> wrote: > Andrew Jaquith wrote: >> >> ** Warning: long post ** >> >> After some fooling around and some actual work, I've finished my first >> pass at refactoring on the anti-spam code. I'm proposing a new >> package, org.apache.wiki.content.inspect, which contains a >> general-purpose content-inspection capability, of which spam is just >> one potential application. Here is a draft of the package javadocs. > > [...] >> >> I can foresee other uses for this too, for example general-purpose >> content classification. But that's for another day. >> >> Comments, thoughts? It's going to take some time to get unit tests >> done, so I won't be committing this for a little while. > > Hi Andrew, > > This sounds pretty impressive, all in all. With my library hat on, my > interest was piqued by the idea of using this for non-spam applications, > so the only comment I have at this point is wondering how you might at > this point include the hook into Lucene. > > The way I'd see this working would be as follows. > > I'd not want to overload the Dublin Core Subject, but as a sort of > informative field that might actually be used to populate the Subject. > The structure of the result of the inspection would be a map of > pseudo-subject (facet?) identifiers and a scope for each, e.g., > > Subject: Shipping, Shipwrecks, Transportation > Pseudo-Subject: Lusitania Score: 0.67 > Pseudo-Subject: http://en.wikipedia.org/wiki/Titanic Score: 0.89 > Pseudo-Subject: Storm Score: 0.56 > Pseudo-Subject: Mermaid Score: 0.24 > > Where the "pseudo-subject" can be either a string or a URI subject > identifier. And noting that "pseudo-subject" is not a term of art and > I'd hope to come up with something more suitable. One could then use > some mathematically-sensible composite of the scores to obtain the > overall score for the document. You could even choose subsets of the > pseudo-subjects to obtain targeted scores. This would still work for > spam detection but would potentially be very powerful for subject > classification, especially if it was tied into the search functionality. > > Does this make any sense? > > Murray >
