Thanks for the advice. I have replaced punctuation before the index is built and then queried on the same lack of punctuation. I had to create a separate index for this as well so I have the original information, but I think I will take your advice and build a custom token to filter out the punctuation but keep the contents the original.
Tom -----Original Message----- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 29, 2005 10:39 AM To: java-user@lucene.apache.org Subject: Re: Indexing puncutation >I do a vaguely similar thing; I have to strip accents from >characters such as e-acute out of both my input data and my incoming >search queries to put them into a standard form. I do this with a >custom TokenFilter subclass. I have an analyzer that includes this >filter along with some of the standard ones (LowercaseFilter, etc). >I run the same analyzer on indexing and searching, which has been >discussed in other posts. For a hard-core approach to this problem, you could try converting all text to Unicode first, then use the ICU package to create a level 0 "sort key" (the C API is col_getSortKey). This will be a string suitable for comparison to determine weak equality, but you can also just index it as a regular token. There are some subtle issues w/locale-specific behavior of the sort key generation step, where you could guess at the right locale to use for the conversion, but in general that shouldn't matter. Two other issues are code/data size (ICU can be big) and the performance hit while indexing documents. -- Ken >Aigner, Thomas wrote: > >>Hello all, >> >> I am VERY new to Lucene and we are trying out Lucene to see if >>it will accomplish the vast majority of our search functions. >> >> I have a question about a good way to index some of our product >>description codes. We have description codes like 21-MA-GAB and other >>punctuation. Our users need to be able to search for "21 MA GAB" >>or "21-MA_GAB" or "21MAGAB". Is the best way to accomplish this by >>creating synonyms for the 3 different ways when punctuation is in parts >>to search for? I know I can stop punctuation in the index but what about >>grouping the information together or with spaces? >> >>Thanks all in advance, >>Tom -- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]