Hi Javier, It seems, ZWSP is a format character in the Unicode standard from version 4.0.1 (http://blogs.msdn.com/michkap/archive/2005/03/12/394716.aspx), because ZWSP is not a general spacing character for all languages without explicit word boundaries. If it is for Khmer, it will be better to modify word breaking only for Khmer in the i18npool module in the future (when OOo will differentiate word breaks for typesetting and word boundary detection for spell checking).
Thai word breaking of OpenOffice.org uses only ICU. There was a plan to add better Thai word breaking to ICU here: http://osdir.com/ml/lib.icu.general/2003-01/msg00023.html, (and Khmer word breaking to OpenOffice.org here: http://sourceforge.net/projects/khspell/), but I believe, recent Thai word breaking method in ICU uses the original dictionary based algorithm of IBM. It is better to ask a breakiterator/ICU OpenOffice.org developer, Karl Hong or Eike Rathke about the requirements of the Khmer ICU/OpenOffice word breaking support. LGPLed OpenOffice.org contains Japanese and Chinese dictionaries, ICU (released under the MIT license) contains a Thai dictionary, so it is a license problem, too. Regards, László 2008/6/21 Javier SOLA <[EMAIL PROTECTED]>: > Hi László, > > I found out why ZWSP does not work as a word-boundary in ICU. They decided > that it was not a spacing character, but a format character, and not a word > boundary. I have sunmitted a patch to the ICU for OOo in which we revert it > to be a spacing character. > > We are developing a dictionary-based breakiterator as the one for Thai in > ICU. The question is: does OpenOffice have special code for tokenization in > Thai? or is the ICU breakiterator enough. > > I want to know if besides writing the ICU breakiterator, I also have to do > something in OOo. > > Thanks, > > Javier > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
