Greetings. Using the Lucene demo, I set up a JSP front end into an index of HTML documentation. This was a FTS prototype for a publications department. It's worked well and we would like to develop it further. There appears, however, to be one potential hitch -- language support. Our documentation will eventually be localized into several languages (not yet specified). I'm not a programmer, but I have some understanding of what a search engine has to do to support a language. I'd like to determine how much language-support you get with the standard tokenizer/analyzer (I know that a German analyzer is included with the core API).
1) The FAQ states: "You can use Lucene for documents in any language. However if you want to use a stemmer you will have to write one yourself. Also, for some languages, the tokenizing rules of Lucene's tokenizes inappropriate(ly) so you many need to write your own tokenizer as well." Does this mean that the standard tokenizer/analyzer would suffice for some European languages? And if so, which ones? 2) For Japanese or Chinese, Lucene supports the character sets. But could you use the default tokenizer and analyzer for those languages and expect decent results? Thanks, Frank Eichenlaub -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
