of the
abbreviations)
Regards
Uwe Goetzke
-Ursprüngliche Nachricht-
Von: Cedric Ho [mailto:[EMAIL PROTECTED]
Gesendet: Samstag, 10. November 2007 02:28
An: java-user@lucene.apache.org
Betreff: - Re: Chinese Segmentation with Phase Query
On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote
Betreff: - Re: Chinese Segmentation with Phase Query
On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote:
Hi Cedric,
On 11/08/2007, Cedric Ho wrote:
a sentence containing characters ABC, it may be segmented into AB, C or
A, BC.
[snip]
In this cases we would like
Hi Cedric
You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
a perfect solution for Chinese word segmentation, but will solve the
problem in your case.
On Nov 9, 2007 10:59 AM, Cedric Ho [EMAIL PROTECTED] wrote:
Hi,
We are having an issue while indexing Chinese Documents
Hi Cedric,
On 11/08/2007, Cedric Ho wrote:
a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
[snip]
In this cases we would like to index both segmentation into the index:
AB offset (0,1) position 0A offset (0,0) position 0
C offset (2,2) position 1
The CJKAnalyzer is too simple for our need. But thanks for suggesting anyway.
Cheers,
Cedric
On Nov 9, 2007 10:43 PM, Open Study [EMAIL PROTECTED] wrote:
Hi Cedric
You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
a perfect solution for Chinese word segmentation, but
On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote:
Hi Cedric,
On 11/08/2007, Cedric Ho wrote:
a sentence containing characters ABC, it may be segmented into AB, C or A,
BC.
[snip]
In this cases we would like to index both segmentation into the index:
AB offset (0,1)
Hi,
We are having an issue while indexing Chinese Documents in Lucene.
Some background first:
Since CJK languages doesn't have space between words, we first have to
determine the words from sentences. e.g.
a sentence containing characters ABC, it may be segmented into AB, C or A, BC.
the