Thanks for the advice. I want to keep the capitalization because in our application we are mining specific contact and company names from news articles. About 99% of the time if we match a contact or company and it's capitalized we avoid false matches.
--Larry On May 18, 2010, at 7:46 PM, Erick Erickson wrote: > You can construct your own analyzer by creating > it from a pre-existing Tokenizer > (e.g. WhiteSpaceTokenizer) and any number > of TokenfFilters (e.g. TokenFilter). You can > string any number of TokenFilters together > to get many different effects. > > But I have to ask, why you want to keep capitalization? > and punctuation? Do you really want to fail to match > text indexed with "Erickson, Erick" with the query > "erick erickson"? That's often a source of frustration > instead of goodness. > > HTH > Erick > > On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <lahend...@wisc.edu> wrote: > >> Hi, >> >> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having >> problems with stemming. Does anyone have a recommendation for other text >> analyzers that handle stemming and also keep capitalization, stop words, and >> punctuation? >> >> Thanks, >> Larry >> >> >> Larry A. Hendrix, Graduate Student >> Computer Science Department >> University of Wisconsin-Madison >> 1300 University Ave Rm 6749 >> Madison, WI 53711 >> Office: (608) 263-7624 >> lhend...@cs.wisc.edu >> Grambling State University Alum >> >> Larry A. Hendrix, Graduate Student Computer Science Department University of Wisconsin-Madison 1300 University Ave Rm 6749 Madison, WI 53711 Office: (608) 263-7624 lhend...@cs.wisc.edu Grambling State University Alum