I would opt for the most specific tokenization that is feasible (no stemming, as much compounding as possible). The rationale for this is that stemming and uncompounding can be added by linear transformations of the matrix at any time.
The only serious issue with this is the problem of overlapping compound words. On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <[email protected]>wrote: > I assume there would also be an issue of which tokenizer to use to create > the terms from the text. > > And possibly issues around storing separate vectors for (at least) title > vs. content? > > Anybody have input on either of these? > >
