RE: Indexing puncutation

Chris Hostetter Wed, 29 Jun 2005 11:48:21 -0700


keep in mind, you can "store" the raw field for display purposes and
"index" many different token sequences that represent the same orriginal
data parsed in several ways -- all using the same field name.



: Date: Wed, 29 Jun 2005 13:33:42 -0400
: From: "Aigner, Thomas" <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: RE: Indexing puncutation
:
: Thanks for the advice.  I have replaced punctuation before the index is
: built and then queried on the same lack of punctuation.  I had to create
: a separate index for this as well so I have the original information,
: but I think I will take your advice and build a custom token to filter
: out the punctuation but keep the contents the original.
:
: Tom
:
: -----Original Message-----
: From: Ken Krugler [mailto:[EMAIL PROTECTED]
: Sent: Wednesday, June 29, 2005 10:39 AM
: To: java-user@lucene.apache.org
: Subject: Re: Indexing puncutation
:
: >I do a vaguely similar thing;  I have to strip accents from
: >characters such as e-acute out of both my input data and my incoming
: >search queries to put them into a standard form.  I do this with a
: >custom TokenFilter subclass.  I have an analyzer that includes this
: >filter along with some of the standard ones (LowercaseFilter, etc).
: >I run the same analyzer on indexing and searching, which has been
: >discussed in other posts.
:
: For a hard-core approach to this problem, you could try converting
: all text to Unicode first, then use the ICU package to create a level
: 0 "sort key" (the C API is col_getSortKey). This will be a string
: suitable for comparison to determine weak equality, but you can also
: just index it as a regular token.
:
: There are some subtle issues w/locale-specific behavior of the sort
: key generation step, where you could guess at the right locale to use
: for the conversion, but in general that shouldn't matter.
:
: Two other issues are code/data size (ICU can be big) and the
: performance hit while indexing documents.
:
: -- Ken
:
:
:
: >Aigner, Thomas wrote:
: >
: >>Hello all,
: >>
: >>    I am VERY new to Lucene and we are trying out Lucene to see if
: >>it will accomplish the vast majority of our search functions.
: >>
: >>    I have a question about a good way to index some of our product
: >>description codes.  We have description codes like 21-MA-GAB and other
: >>punctuation.  Our users need to be able to search for "21 MA GAB"
: >>or "21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
: >>creating synonyms for the 3 different ways when punctuation is in
: parts
: >>to search for? I know I can stop punctuation in the index but what
: about
: >>grouping the information together or with spaces?
: >>
: >>Thanks all in advance,
: >>Tom
:
:
: --
: Ken Krugler
: TransPac Software, Inc.
: <http://www.transpac.com>
: +1 530-470-9200
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing puncutation

Reply via email to