Chris, I don't have the issue number here, but look in Lucene's JIRA and search for... ah, here:
https://issues.apache.org/jira/browse/LUCENE-1166 And for Chinese: https://issues.apache.org/jira/browse/LUCENE-1629 If you happen to be using Solr: http://www.sematext.com/product-multilingual-analyzer.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Chris Collins <chris_j_coll...@yahoo.com> > To: general@lucene.apache.org > Sent: Monday, May 11, 2009 11:28:06 AM > Subject: Re: what if my database data contains other language (like danish, > german). > > Is anyone aware of either of the two things: > > 1) ability to plugin an external source for DF, this would allow you to > circumvent the problem you mentioned below. (Of course you would have to > compute a df set for each language you care to have meaningful weights for). > 2) any open source segmenters, primarily for german, but also for CJK at a > longshot :-} > > Thanks > > C > > On May 11, 2009, at 8:13 AM, Ted Dunning wrote: > > > Yes. Lucene can handle that. You have to select which stemmer to use. You > > may have to improve the German and Danish stemmers a little bit. > > > > You may also have some issues with the fact that if Danish is 5% of your > > corpus, then words that occur in 100% of your Danish documents will tend to > > have too high weights since they only occur in 5% of your documents. Any > > term that occurs in more than 20% of a sub-corpus should generally be > > discarded from your query. This can be difficult in multi-lingual > > situations. > > > > For a first pass, I would ignore this issue, however. > > > > On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote: > > > >> what if my database data contains other language (like danish, german). > >> > >> Is Lucene will handle that . > >> > >> If yes How? > >> > > > > > > > > --Ted Dunning, CTO > > DeepDyve > > > > 111 West Evelyn Ave. Ste. 202 > > Sunnyvale, CA 94086 > > www.deepdyve.com > > 858-414-0013 (m) > > 408-773-0220 (fax)