I do a vaguely similar thing; I have to strip accents from
characters such as e-acute out of both my input data and my incoming
search queries to put them into a standard form. I do this with a
custom TokenFilter subclass. I have an analyzer that includes this
filter along with some of the standard ones (LowercaseFilter, etc).
I run the same analyzer on indexing and searching, which has been
discussed in other posts.
For a hard-core approach to this problem, you could try converting
all text to Unicode first, then use the ICU package to create a level
0 "sort key" (the C API is col_getSortKey). This will be a string
suitable for comparison to determine weak equality, but you can also
just index it as a regular token.
There are some subtle issues w/locale-specific behavior of the sort
key generation step, where you could guess at the right locale to use
for the conversion, but in general that shouldn't matter.
Two other issues are code/data size (ICU can be big) and the
performance hit while indexing documents.
-- Ken
Aigner, Thomas wrote:
Hello all,
I am VERY new to Lucene and we are trying out Lucene to see if
it will accomplish the vast majority of our search functions.
I have a question about a good way to index some of our product
description codes. We have description codes like 21-MA-GAB and other
punctuation. Our users need to be able to search for "21 MA GAB"
or "21-MA_GAB" or "21MAGAB". Is the best way to accomplish this by
creating synonyms for the 3 different ways when punctuation is in parts
to search for? I know I can stop punctuation in the index but what about
grouping the information together or with spaces?
Thanks all in advance,
Tom
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]