On Fri, Nov 13, 2009 at 10:35 AM, Ken Krugler <[email protected]>wrote:
> Hi Ted, > > On Nov 3, 2009, at 6:37pm, Ted Dunning wrote: > > I would opt for the most specific tokenization that is feasible (no >> stemming, as much compounding as possible). >> > > By "as much compounding as possible", do you mean you want the tokenizer to > do as much splitting as possible, or as little? > My ultimate preference is to actually glue very common phrases into a single term. This can be reversed with a linear transformation. This may not be feasible for a first hack. > E.g. "super-duper" should be left as-is, or turned into "super" and > "duper"? > Left as is. And New York should be tokenized as a single term. Likewise with "Staff writer of the wall street journal". > Is there a particular configuration of Lucene tokenizers that you'd > suggest? > I am not an expert, but I know of no tokenizers that will do this. Lucene typically retains positional information which means that searching for phrases is relatively cheap (10x in search time, but most collections take approximately zero time to search). Maybe the best answer is to produce two vectorized versions, one heavily stemmed and split apart (the micro-token approach) and another one that we can progressively improve would be the mega-token version. The final arbiter should be whoever does the work (you!).
