Re: [lingu-dev] Unicode normalisation

Németh László Mon, 05 Jan 2009 07:11:00 -0800

Hi,

Really, this is not only a spell checking problem. OpenOffice.org has
problems with both of visual and functional equivalence of Unicode
characters.  For example, here is the result of the Find all ä
operation on ÄÄää, i.e. on the "A U+0308 (COMBINING DIARESIS) Ä a
U+0308 ä" character sequence:
http://www.flickr.com/photos/85171...@n00/3170574450/

It would be fine to solve this problem in the future OpenOffice.org
versions by automatic Unicode normalization, also by OpenType support.
Hunspell 1.2.x (I hope, it will be in OOo 3.1) has a temporary
solution for Unicode normalization (canonical and compatiblity), the
optional input/output conversion:

ICONV 4
ICONV Ä Ä
ICONV ä ä
ICONV 가 ᄀ ᅡ
ICONV ﬁ fi

First three conversion is canonical normalization: two composition and
a Hangul decomposition. Conversion of the ﬁ ligature is a
compatibility normalization (but spell checking of words with
f-ligatures needs fixed word breaking in OOo, too).

Conversion of the spell checking suggestions to the original composed form:

OCONV 2
OCONV ᄀ ᅡ 가
OCONV fi ﬁ

(Special spell checking requirements needs special solution. For
example, German typography uses only f-ligatures within words, bot not
in compound word boundary, so the previous OCONV fi ﬁ conversion is
not right for German. A redundant dictionary with non-suggested
decomposed forms, and dictionary words with ligatures helps to check
the correct typography of a German text:

--- affix file ---
NOSUGGEST *
REP 2
REP fi ﬁ
REP ﬁ fi

--- dictionary file ----
finden/*
ﬁnden
)

Hyphenation of both of composed and decomposed characters is possible
in OOo by redundant hyphenation patterns in OpenOffice.org.
Compatibility equivalent ligatures can be handled by non-standard
hyphenation (alternations):

ﬁ1/f=i,1,1

For thesauri it is a temporary solution using redundant items or references:

ﬁnden->finden

Incoming stemming in OOo thesaurus by Hunspell is also can handle
normalization problem temporarily.
ICONV input conversion or explicit stems (
--- dic file ---
ﬁnden st:finden
) can give the normalized stems to the thesaurus component.

Maybe a new Hunspell tool could help the spelling dictionary
developers by the automatic generation of the ICONV normalization
table.

Regards,
László

2009/1/5 Stephan Bergmann <[email protected]>:
> On 01/02/09 09:51, F Wolff wrote:
>>
>> Hallo all
>>
>> We recently had a discussion on a list for African localisation about
>> the utility of having Unicode normalisation automatically done in
>> Hunspell, so that creators of spell checkers wouldn't need to worry
>> about that.
>>
>> Is this a feature that would be useful to more people? Is there
>> something generic in OOo that handles normalisation issues for other
>> purposes? (searching, thesaurus, indexes, etc.)  I can think of many
>> places where it could be relevant.
>>
>> I'm curious to hear what other people think.
>
> I brought this up years ago as point 4 of
> <http://www.openoffice.org/servlets/ReadMsg?list=dev&msgNo=7099>, but
> nothing became of it back then...
>
> -Stephan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [lingu-dev] Unicode normalisation

Reply via email to