Re: Chinese Segmentation with Phase Query

2007-11-10 Thread Uwe Goetzke
of the abbreviations) Regards Uwe Goetzke -Ursprüngliche Nachricht- Von: Cedric Ho [mailto:[EMAIL PROTECTED] Gesendet: Samstag, 10. November 2007 02:28 An: java-user@lucene.apache.org Betreff: - Re: Chinese Segmentation with Phase Query On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote

Re: Chinese Segmentation with Phase Query

2007-11-10 Thread Cedric Ho
Betreff: - Re: Chinese Segmentation with Phase Query On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Cedric, On 11/08/2007, Cedric Ho wrote: a sentence containing characters ABC, it may be segmented into AB, C or A, BC. [snip] In this cases we would like

Re: Chinese Segmentation with Phase Query

2007-11-09 Thread Open Study
Hi Cedric You may try the CJKAnalyzer within the lucene sandbox. It doesn't give a perfect solution for Chinese word segmentation, but will solve the problem in your case. On Nov 9, 2007 10:59 AM, Cedric Ho [EMAIL PROTECTED] wrote: Hi, We are having an issue while indexing Chinese Documents

RE: Chinese Segmentation with Phase Query

2007-11-09 Thread Steven A Rowe
Hi Cedric, On 11/08/2007, Cedric Ho wrote: a sentence containing characters ABC, it may be segmented into AB, C or A, BC. [snip] In this cases we would like to index both segmentation into the index: AB offset (0,1) position 0A offset (0,0) position 0 C offset (2,2) position 1

Re: Chinese Segmentation with Phase Query

2007-11-09 Thread Cedric Ho
The CJKAnalyzer is too simple for our need. But thanks for suggesting anyway. Cheers, Cedric On Nov 9, 2007 10:43 PM, Open Study [EMAIL PROTECTED] wrote: Hi Cedric You may try the CJKAnalyzer within the lucene sandbox. It doesn't give a perfect solution for Chinese word segmentation, but

Re: Chinese Segmentation with Phase Query

2007-11-09 Thread Cedric Ho
On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Cedric, On 11/08/2007, Cedric Ho wrote: a sentence containing characters ABC, it may be segmented into AB, C or A, BC. [snip] In this cases we would like to index both segmentation into the index: AB offset (0,1)

Chinese Segmentation with Phase Query

2007-11-08 Thread Cedric Ho
Hi, We are having an issue while indexing Chinese Documents in Lucene. Some background first: Since CJK languages doesn't have space between words, we first have to determine the words from sentences. e.g. a sentence containing characters ABC, it may be segmented into AB, C or A, BC. the