Re: Stemming Problem

Larry Hendrix Wed, 19 May 2010 19:03:53 -0700

Thanks for the advice. I want to keep the capitalization because in our 
application we are mining specific contact and company names from news 
articles. About 99% of the time if we match a contact or company and it's 
capitalized we avoid false matches.


--Larry

On May 18, 2010, at 7:46 PM, Erick Erickson wrote:

> You can construct your own analyzer by creating
> it from a pre-existing Tokenizer
> (e.g. WhiteSpaceTokenizer) and any number
> of TokenfFilters (e.g. TokenFilter). You can
> string any number of TokenFilters together
> to get many different effects.
> 
> But I have to ask, why you want to keep capitalization?
> and punctuation? Do you really want to fail to match
> text indexed with "Erickson, Erick" with the query
> "erick erickson"? That's often a source of frustration
> instead of goodness.
> 
> HTH
> Erick
> 
> On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix <[email protected]> wrote:
> 
>> Hi,
>> 
>> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
>> problems with stemming. Does anyone have a recommendation for other text
>> analyzers that handle stemming and also keep capitalization, stop words, and
>> punctuation?
>> 
>> Thanks,
>> Larry
>> 
>> 
>> Larry A. Hendrix, Graduate Student
>> Computer Science Department
>> University of Wisconsin-Madison
>> 1300 University Ave Rm 6749
>> Madison, WI 53711
>> Office: (608) 263-7624
>> [email protected]
>> Grambling State University Alum
>> 
>> 

Larry A. Hendrix, Graduate Student 
Computer Science Department 
University of Wisconsin-Madison 
1300 University Ave Rm 6749 
Madison, WI 53711 
Office: (608) 263-7624 
[email protected] 
Grambling State University Alum

Re: Stemming Problem

Reply via email to