Re: [lingu-dev] Unicode in MySpell

nemeth Fri, 23 Dec 2005 00:27:50 -0800

Hi Artavazd,

You can use your patch for Armenian OOo 2.0, but using Hunspell
(really extended MySpell) is a general solution for encoding problems.


Hunspell integration is targeted to OOo 2.0.2 (end of february 2006),
but you can integate Hunspell to the OOo 2.0 source from the CWS `hunspell':

$ cvs -d:pserver:[EMAIL PROTECTED]:/cvs login
password: anoncvs
$ cvs -z3 -d:pserver:[EMAIL PROTECTED]:/cvs co -r
cws_src680_hunspell lingucomponent config_office scp2

Quoting Artavazd Mertarjyan <[EMAIL PROTECTED]>:

> Hi all!
>
> I'm working on OpenOffice.org 2.0 localization in Armenian.
> We at the Open Source Armenia team have already localized version 1.1.0 at
> 01.09.2003. At that time there was no solution for Unicode spell checker.
> The issue is that there's no 8-bit encoding for Armenian, and only Windows
> XP supports Armenian language in Unicode.
>
> To resolve this encoding problem, I've created a "pseudo" 8-bit encoding.
> The whole algorithm of the solution is almost similar to HunSpell, but in my
> case class MySpell makes steps like these:

Aspell has a similar solution. It's enough good for the most languages, but
there are some languages with more than 255 characters.
For professional/scientific spell checking there is also a need to combine
multiple character encodings (for example, using foreign
geographical, person and other proper names). It's difficult for agglutinative
languages, because these languages can combine the different characters
in one word (with the foreign stems + native affixes): in Hungarian "about
&#197;ngström" is "&#197;ngströmről", a word with latin-1 (&#197;) and latin-2
(ő) characters.

Hunspell handles really 16-bit encoding. Nepali and Hungarian OOo 2.0 use
Hunspell with Unicode (UTF-8) Nepali and Hungarian dictionaries.

> 1. Encoding detection.
> 2. If Armenian, convert UTF-8 text to ARMSCII-8(pseudo 8-bit encoding).
> 3. If incorrect, make suggestion list from dictionary (8-bit encoding).
> 4. Return suggestion list converted from ARMSCII-8 to UTF-8.
> So I've added
> 1. ARMSCII-8 to UTF-8 converter
> 2. UTF-8 to ARMSCII-8 converter
> 3. A different "special_chars" in "cleanword" method.

Special_chars in clean_word() is deleted from the MySpell source.
The right tokenization comes from the OOo's breakiterator.
If the default tokenization is bad for Armenian, you need a Breakiterator
patch. (See i18npool/source/breakiterator/ and its data/ subdirectory).

Best regards,

Laci

>
> Thus, I would like to ask you to consider this option, and would very much
> like to get a feedback from OpenOffice community in that regard.
>
> You can see the sources at:
> http://hy.openoffice.org/source/browse/hy/src/2.0.0/lingucomponent/source/sp
> ellcheck/myspell/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Unicode in MySpell

Reply via email to