keep in mind, you can "store" the raw field for display purposes and "index" many different token sequences that represent the same orriginal data parsed in several ways -- all using the same field name.
: Date: Wed, 29 Jun 2005 13:33:42 -0400 : From: "Aigner, Thomas" <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: RE: Indexing puncutation : : Thanks for the advice. I have replaced punctuation before the index is : built and then queried on the same lack of punctuation. I had to create : a separate index for this as well so I have the original information, : but I think I will take your advice and build a custom token to filter : out the punctuation but keep the contents the original. : : Tom : : -----Original Message----- : From: Ken Krugler [mailto:[EMAIL PROTECTED] : Sent: Wednesday, June 29, 2005 10:39 AM : To: java-user@lucene.apache.org : Subject: Re: Indexing puncutation : : >I do a vaguely similar thing; I have to strip accents from : >characters such as e-acute out of both my input data and my incoming : >search queries to put them into a standard form. I do this with a : >custom TokenFilter subclass. I have an analyzer that includes this : >filter along with some of the standard ones (LowercaseFilter, etc). : >I run the same analyzer on indexing and searching, which has been : >discussed in other posts. : : For a hard-core approach to this problem, you could try converting : all text to Unicode first, then use the ICU package to create a level : 0 "sort key" (the C API is col_getSortKey). This will be a string : suitable for comparison to determine weak equality, but you can also : just index it as a regular token. : : There are some subtle issues w/locale-specific behavior of the sort : key generation step, where you could guess at the right locale to use : for the conversion, but in general that shouldn't matter. : : Two other issues are code/data size (ICU can be big) and the : performance hit while indexing documents. : : -- Ken : : : : >Aigner, Thomas wrote: : > : >>Hello all, : >> : >> I am VERY new to Lucene and we are trying out Lucene to see if : >>it will accomplish the vast majority of our search functions. : >> : >> I have a question about a good way to index some of our product : >>description codes. We have description codes like 21-MA-GAB and other : >>punctuation. Our users need to be able to search for "21 MA GAB" : >>or "21-MA_GAB" or "21MAGAB". Is the best way to accomplish this by : >>creating synonyms for the 3 different ways when punctuation is in : parts : >>to search for? I know I can stop punctuation in the index but what : about : >>grouping the information together or with spaces? : >> : >>Thanks all in advance, : >>Tom : : : -- : Ken Krugler : TransPac Software, Inc. : <http://www.transpac.com> : +1 530-470-9200 : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]