The classifier I happen to be working with is entirely supervised -- the documents in the corpus are assigned categories based on structured document data and we extract features from the text to do the training. The whitelist identifies which n-grams should be used as features.
I suspect there is something similar to what you described that can be done here related to looking at the representation of n-grams in a class vs outside a category but I need to dig deeper into the classifier mechanics to see if that would lead to some sort of overfitting. Thanks for the suggestion. Drew On Tue, Feb 16, 2010 at 1:52 PM, Jake Mannix <[email protected]> wrote: > So since you're building both a classifier and a search index, I'm guessing > to train your classifier you have at least some example docs to train on, > right? If you have an n-way classifier in which one of the classes is > "other/unclassified", then you could look for ngrams which are > overrepresented in the union of the classes which aren't "other" (ie these > ngrams are representative of some useful class). These ngrams could form > your whitelist. > > -jake > > On Feb 16, 2010 10:23 AM, "Drew Farris" <[email protected]> wrote: > > Hi Jake, > > Yes, I'm using the LLR score. I was wondering if there is anything > else I should be looking at other than LLR and min/max DF. The corpus > is large and the list is too big to review by hand, so wondering if > there's any sort of additional measure I can use to suggest whether I > should consider stopping additional subgrams or something of that > nature. > > Ideally, this would be something that could be rolled back into the > existing collocation identifier in Mahout. > > Thanks, > > Drew > > (Thanks also Ken, Jason for the comments and pointers -- DF is highly > effective indeed.) > > On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[email protected]> wrote: >> Drew, > > Did you p... >
