Hi
I'm developing an application using Lucene where I need to be able to both search using a stemmer and sometimes using "exact" search.
I see two ways of doing this:
1. Use two indexes. One using a stemming analyzer and one using a SimpleAnalyzer
2. Using duplicate fields. One field with stemmed content and one with unstemmed content. (Perhaps the field CONTENT, will be CONTENT and CONTENT_RAW)
I'm leaning towards option 2. However I'm interested in any performance implications. If I understand it correctly Lucene keeps separate term-dictionaries for each field. So besides the index growing larger (which might affect caching) it won't be any slower searching the index with duplicate fields when I only query on the CONTENT field
Is this correct?
I wouldn't concern yourself with performance at this stage. Granted here in Lucene Land, performance is key, but Lucene will be plenty fast in either of these scenarios. You say "sometimes" for toggling between exact and stemmed. If your requirement was that it was "always" both, then you could leverage another option - having the custom analyzer place stemmed and exact terms in the same term position (set increment to zero for the stemmed words).
But since you need to toggle between exact and stemmed, I'd opt for #2 as well.
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]