The classifier I happen to be working with is entirely supervised --
the documents in the corpus are assigned categories based on
structured document data and we extract features from the text to do
the training. The whitelist identifies which n-grams should be used as
features.

I suspect there is something similar to what you described that can be
done here related to looking at the representation of n-grams in a
class vs outside a category but I need to dig deeper into the
classifier mechanics to see if that would lead to some sort of
overfitting. Thanks for the suggestion.

Drew

On Tue, Feb 16, 2010 at 1:52 PM, Jake Mannix <[email protected]> wrote:
> So since you're building both a classifier and a search index, I'm guessing
> to train your classifier you have at least some example docs to train on,
> right?   If you have an n-way classifier in which one of the classes is
> "other/unclassified", then you could look for ngrams which are
> overrepresented in the union of the classes which aren't "other" (ie these
> ngrams are representative of some useful class).  These ngrams could form
> your whitelist.
>
>  -jake
>
> On Feb 16, 2010 10:23 AM, "Drew Farris" <[email protected]> wrote:
>
> Hi Jake,
>
> Yes, I'm using the LLR score. I was wondering if there is anything
> else I should be looking at other than LLR and min/max DF. The corpus
> is large and the list is too big to review by hand, so wondering if
> there's any sort of additional measure I can use to suggest whether I
> should consider stopping additional subgrams or something of that
> nature.
>
> Ideally, this would be something that could be rolled back into the
> existing collocation identifier in Mahout.
>
> Thanks,
>
> Drew
>
> (Thanks also Ken, Jason for the comments and pointers -- DF is highly
> effective indeed.)
>
> On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[email protected]> wrote:
>> Drew, > >  Did you p...
>

Reply via email to