bhecht <[EMAIL PROTECTED]> wrote on 07/05/2007 10:26:27: > I have implemented my own analyzer for each country. > So as I see it, when I index these records, I want to > provide lucene, with a specific analyzer per record > i'm indexing. > > When a user performs a query in my JSF form, I will > use the country value he entered, to get the needed > analyzer, and query lucene with the users query and > the needed analyzer. > > The user may also choose not to enter a country value > to his search, and here comes in the solution you gave > me, to duplicate each field, and index it using a non > stemming analyzer (A standard analyzer without stop > words defined). > > Am I going the right direction?
Sounds ok to me except that there seems to be a mix between stemming and stop-words elimination. Perhaps just a typo in the above text, but anyhow while the StandardAnalyzer constructor takes a stopwords list parameter and would eliminate these words (e.g. "is"), it would not do stemming (e.g "knives" --> "knive"). (Though both a stop-list and a stemming algorithm are language specific.) So, rephrasing the discussion so far, assuming: 1) a single field "F" (for simplicity), 2) (doc) language always known at indexing 3) (user) language sometimes known at search I think a resonable solution might be: 1) use PerFieldanalyzerWrapper 2) index each doc to F and to F_LAN 3) F would be language neutral - no stemming and no stop words elimination 4) F_LAN (e.g. F_en) would be language specific, so a specific language stopwords list would be used, and a specific stemmer would be used. 5) Search would go to F_LAN when the language is known and to F when the language is not known, using language specific analysis as while indexing. 6) Note Karl's mentioning having both F and F_LAN at search, assigning higher boost to F_LAN. Useful when there is some uncertainty on the "marked language". There can be other considerations - for instance (1) the certainty of language id; (2) fallback to English when the language is unknown... Note that SnowballFilter can be used for applying stemming on the output of StandardAnalyzer. Doron --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]