RE: Indexing puncutation

Aigner, Thomas Wed, 29 Jun 2005 10:33:57 -0700

Thanks for the advice.  I have replaced punctuation before the index is
built and then queried on the same lack of punctuation.  I had to create
a separate index for this as well so I have the original information,
but I think I will take your advice and build a custom token to filter
out the punctuation but keep the contents the original.


Tom

-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 29, 2005 10:39 AM
To: [email protected]
Subject: Re: Indexing puncutation

>I do a vaguely similar thing;  I have to strip accents from 
>characters such as e-acute out of both my input data and my incoming 
>search queries to put them into a standard form.  I do this with a 
>custom TokenFilter subclass.  I have an analyzer that includes this 
>filter along with some of the standard ones (LowercaseFilter, etc). 
>I run the same analyzer on indexing and searching, which has been 
>discussed in other posts.

For a hard-core approach to this problem, you could try converting 
all text to Unicode first, then use the ICU package to create a level 
0 "sort key" (the C API is col_getSortKey). This will be a string 
suitable for comparison to determine weak equality, but you can also 
just index it as a regular token.

There are some subtle issues w/locale-specific behavior of the sort 
key generation step, where you could guess at the right locale to use 
for the conversion, but in general that shouldn't matter.

Two other issues are code/data size (ICU can be big) and the 
performance hit while indexing documents.

-- Ken



>Aigner, Thomas wrote:
>
>>Hello all,
>>
>>      I am VERY new to Lucene and we are trying out Lucene to see if
>>it will accomplish the vast majority of our search functions.
>>
>>      I have a question about a good way to index some of our product
>>description codes.  We have description codes like 21-MA-GAB and other
>>punctuation.  Our users need to be able to search for "21 MA GAB" 
>>or "21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
>>creating synonyms for the 3 different ways when punctuation is in
parts
>>to search for? I know I can stop punctuation in the index but what
about
>>grouping the information together or with spaces?
>>
>>Thanks all in advance,
>>Tom


-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing puncutation

Reply via email to