Re: Basic question about indexing certain words

syedfa Sun, 27 Dec 2009 15:20:17 -0800

Thanks very much for such a detailed reply, I didn't realize that there was
so much to this subject.  I understand the issue a bit better now!


Take care.


Erick Erickson wrote:
> 
> It depends completely on what analyzer you use. Conceptually, an Analyzer
> is composed of a Tokenizer followed by any number of Filters. So the
> input stream is broken up by the Tokenizer, then each token has one or
> more Filters applied (e.g. LowerCaseFilter, StopWordFilter)..
> 
> The reason I'm not answering your question directly is that I can't. If
> you
> choose, say, a WhitespaceAnalyzer, which is built from a
> WhitespaceTokenizer,
> then your hyphens and apostrophes will pass through as-is, and your tokens
> (the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed",
> capitals
> and all.
> 
> If you choose StandardAnalyzer, built on StandardTokenizer and several
> filters
> your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as
> well).
> 
> You can build your own Analyzers to process text however you please.
> Lucene
> In Action has quite a thorough explanation of this process, you'll save
> yourself
> a bunch of time by reading those sections. You can get the second edition
> of that book in electronic form from Manning through their early access
> program.
> 
> Until you understand this process well, I'd recommend that you be very,
> very
> sure that you use the *same* analyzer for both indexing and searching or
> your
> results will be...surprising.
> 
> Think about getting a copy of Luke to examine your indexes, that tool
> makes
> it
> easy to see the effects of various Analyzers. Google Lucene Luke....
> 
> Finally, you can easily use *different* analyzers for different fields
> within a
> document, see PerFieldAnalyzerWrapper.
> 
> HTH
> Erick
> 
> On Sun, Dec 27, 2009 at 5:48 PM, syedfa <[email protected]> wrote:
> 
>>
>> Dear fellow Java developers:
>>
>> I have a very basic question about indexing text using Lucene.  I am
>> indexing a large amount of text, that includes names that contain certain
>> punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.)  Will the punctuation
>> throw off the indexer in any way, such that it breaks up the tokens when
>> they shouldn't be, or will the indexer simply treat the punctuation
>> inside
>> the names as any other character, and the presence of the punctuation
>> will
>> not in any way hinder a user's ability to search for that name?  Are
>> there
>> any precautions that I should take to avoid any problems?
>>
>> I hope this question is clear and makes sense.
>>
>> Thanks in advance to all who reply.
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26938117.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Basic question about indexing certain words

Reply via email to