Thanks very much for such a detailed reply, I didn't realize that there was so much to this subject. I understand the issue a bit better now!
Take care. Erick Erickson wrote: > > It depends completely on what analyzer you use. Conceptually, an Analyzer > is composed of a Tokenizer followed by any number of Filters. So the > input stream is broken up by the Tokenizer, then each token has one or > more Filters applied (e.g. LowerCaseFilter, StopWordFilter).. > > The reason I'm not answering your question directly is that I can't. If > you > choose, say, a WhitespaceAnalyzer, which is built from a > WhitespaceTokenizer, > then your hyphens and apostrophes will pass through as-is, and your tokens > (the minimal searchable unit) will be "Jane" "Doe-Smith" and "Sa'eed", > capitals > and all. > > If you choose StandardAnalyzer, built on StandardTokenizer and several > filters > your tokens would be "jane" "doe" "smith" "sa" "eed" (note lower-casing as > well). > > You can build your own Analyzers to process text however you please. > Lucene > In Action has quite a thorough explanation of this process, you'll save > yourself > a bunch of time by reading those sections. You can get the second edition > of that book in electronic form from Manning through their early access > program. > > Until you understand this process well, I'd recommend that you be very, > very > sure that you use the *same* analyzer for both indexing and searching or > your > results will be...surprising. > > Think about getting a copy of Luke to examine your indexes, that tool > makes > it > easy to see the effects of various Analyzers. Google Lucene Luke.... > > Finally, you can easily use *different* analyzers for different fields > within a > document, see PerFieldAnalyzerWrapper. > > HTH > Erick > > On Sun, Dec 27, 2009 at 5:48 PM, syedfa <fayyazud...@gmail.com> wrote: > >> >> Dear fellow Java developers: >> >> I have a very basic question about indexing text using Lucene. I am >> indexing a large amount of text, that includes names that contain certain >> punctuation (eg. "Jane Doe-Smith", "Sa'eed", etc.) Will the punctuation >> throw off the indexer in any way, such that it breaks up the tokens when >> they shouldn't be, or will the indexer simply treat the punctuation >> inside >> the names as any other character, and the presence of the punctuation >> will >> not in any way hinder a user's ability to search for that name? Are >> there >> any precautions that I should take to avoid any problems? >> >> I hope this question is clear and makes sense. >> >> Thanks in advance to all who reply. >> >> -- >> View this message in context: >> http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26937880.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > -- View this message in context: http://old.nabble.com/Basic-question-about-indexing-certain-words-tp26937880p26938117.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org