Thanks Otis, I will take a look.

Best

C
On May 17, 2009, at 7:05 PM, Otis Gospodnetic wrote:


Chris,

I don't have the issue number here, but look in Lucene's JIRA and search for... ah, here:

 https://issues.apache.org/jira/browse/LUCENE-1166


And for Chinese:

 https://issues.apache.org/jira/browse/LUCENE-1629

If you happen to be using Solr:

 http://www.sematext.com/product-multilingual-analyzer.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Chris Collins <chris_j_coll...@yahoo.com>
To: general@lucene.apache.org
Sent: Monday, May 11, 2009 11:28:06 AM
Subject: Re: what if my database data contains other language (like danish, german).

Is anyone aware of either of the two things:

1) ability to plugin an external source for DF, this would allow you to circumvent the problem you mentioned below. (Of course you would have to compute a df set for each language you care to have meaningful weights for). 2) any open source segmenters, primarily for german, but also for CJK at a
longshot :-}

Thanks

C

On May 11, 2009, at 8:13 AM, Ted Dunning wrote:

Yes. Lucene can handle that. You have to select which stemmer to use. You
may have to improve the German and Danish stemmers a little bit.

You may also have some issues with the fact that if Danish is 5% of your corpus, then words that occur in 100% of your Danish documents will tend to have too high weights since they only occur in 5% of your documents. Any term that occurs in more than 20% of a sub-corpus should generally be
discarded from your query.  This can be difficult in multi-lingual
situations.

For a first pass, I would ignore this issue, however.

On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:

what if my database data contains other language (like danish, german).

Is Lucene will handle that .

If yes How?




--Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


Reply via email to