Re: Performance implications of unanlyzed content

Erik Hatcher Fri, 16 Apr 2004 03:06:10 -0700

On Apr 16, 2004, at 2:59 AM, Magnus Johansson wrote:

Hi

I'm developing an application using Lucene where I need to
be able to both search using a stemmer and sometimes using
"exact" search.

I see two ways of doing this:

1. Use two indexes. One using a stemming analyzer and one using
   a SimpleAnalyzer

2. Using duplicate fields. One field with stemmed content and
   one with unstemmed content. (Perhaps the field CONTENT, will be
   CONTENT and CONTENT_RAW)

I'm leaning towards option 2. However I'm interested in any performance
implications. If I understand it correctly Lucene keeps separate
term-dictionaries for each field. So besides the index growing larger
(which might affect caching) it won't be any slower searching the index
with duplicate fields when I only query on the CONTENT field

Is this correct?

I wouldn't concern yourself with performance at this stage. Granted here in Lucene Land, performance is key, but Lucene will be plenty fast in either of these scenarios. You say "sometimes" for toggling between exact and stemmed. If your requirement was that it was "always" both, then you could leverage another option - having the custom analyzer place stemmed and exact terms in the same term position (set increment to zero for the stemmed words).

But since you need to toggle between exact and stemmed, I'd opt for #2 as well.

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performance implications of unanlyzed content

Reply via email to