Re: Indexing puncutation

Ken Krugler Wed, 29 Jun 2005 07:50:56 -0700

I do a vaguely similar thing; I have to strip accents fromcharacters such as e-acute out of both my input data and my incomingsearch queries to put them into a standard form. I do this with acustom TokenFilter subclass. I have an analyzer that includes thisfilter along with some of the standard ones (LowercaseFilter, etc).I run the same analyzer on indexing and searching, which has beendiscussed in other posts.

For a hard-core approach to this problem, you could try convertingall text to Unicode first, then use the ICU package to create a level0 "sort key" (the C API is col_getSortKey). This will be a stringsuitable for comparison to determine weak equality, but you can alsojust index it as a regular token.

There are some subtle issues w/locale-specific behavior of the sortkey generation step, where you could guess at the right locale to usefor the conversion, but in general that shouldn't matter.

Two other issues are code/data size (ICU can be big) and theperformance hit while indexing documents.


-- Ken

Aigner, Thomas wrote:

Hello all,

        I am VERY new to Lucene and we are trying out Lucene to see if
it will accomplish the vast majority of our search functions.

        I have a question about a good way to index some of our product
description codes.  We have description codes like 21-MA-GAB and other

punctuation. Our users need to be able to search for "21 MA GAB"or "21-MA_GAB" or "21MAGAB". Is the best way to accomplish this by

creating synonyms for the 3 different ways when punctuation is in parts
to search for? I know I can stop punctuation in the index but what about
grouping the information together or with spaces?

Thanks all in advance,
Tom



--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing puncutation

Reply via email to