[lingu-dev] new functionality

Chantal ENGUEHARD Wed, 27 Apr 2005 12:11:20 -0700

Hello,

I'm researcher in the field of Natural Language Processing, and especially
towards african language.


I'd like to modify the spell checker: when it detects an error it could
present some additional information with each possible correction. (This idea
is developped in the text below which is a part of an article presented in the
last international Unicode Conference in Berlin).

I understood that the spell checker of OpenOffice is based on Ispell and that
I should modify Ispell to give this new functionality : presenting additional
information with possoble corrections. I found also some documentation about
the structure of the dictionaries.

I contact you to suggest my project and to get help about what to do in order
to reach my goal.

Thanks in advance

Chantal Enguehard

-------------------------------------------------------------------
3.      processing, an asset for the African languages

3.1.    Assumptions

It is imperative to develop low cost strategies (human and financial) to make
possible the constitution and diffusion of linguistic resources.
We make the assumption that the people writing texts in a national language
(school books, technical books, newspapers, Internet sites) could maintain a
good quality of language and produce texts in greater quantity if they had
access to the linguistic resources they need.
However, any text, at one time or another of its development, is composed and
stored in an electronic medium. We make the assumption that this systematic
use of the computers represents an opportunity to seize because they represent
a place to diffuse and collect linguistic data, and offer modern and elaborate
tools that are able to accelerate the slow work of language transcription.

3.2.    Facilitating the dissemination of linguistic information

Linguistic knowledge can be presented in the form of hypertexts downloadable
from the sites of the institutions, but producing hypertexts represents an too
huge amount of work to make it possible in a short-term implement. It seems
more judicious to create orthographical correctors within the text editors
because it is possible to produce such correctors with little human work but
based on corpora texts.
In addition, even if the African languages are currently not well represented
on the Web [Diki-Kidiri 2003], the solution of representing and displaying
special characters and the development of adapted editors, should allow the
creation of Internet sites written in African languages for the speakers of
these languages.

(...)

4.      Objectives

We focus on the development of spelling correctors produced from corpus of
texts, and tools allowing linguists to easily strip corpora of texts in order
to enrich the linguistic resources.

4.1.    Spelling correctors

Short state of the art

Spelling correctors constitute a research orientation since the years 1960
[Kukich 1992]. They are now usually used by general public because the current
text editors often integrate one, and they bring a considerable comfort during
the drafting of texts. These correctors function according to an interactive
mode in which the user intervenes, contrary to the automatic spelling
correctors as in the field of the optical character recognition (and of which
we are not concerned here).

An interactive spelling corrector functions following several stages:
1.      detection of the errors;
2.      selection of the possible corrections;
3.      ordering the possible corrections and proposition to the user;
4.      effective correction of the text respecting the choice of the user.

1. The detection of errors is carried out while considering one by one the
words of the text to correct, in an isolated way. (...)

2. When an error is detected, the corrector selects a series of words likely
to be the correct version of the chain to be corrected. These words are
selected according to various techniques (calculation of the minimal editing
distance, of the key of similarity, or measurement of the phonological
distance).

3. The ordering of the possible corrections takes into account the measurement
used during the previous stage of selection, as well as statistical
measurements (like the frequency of appearance of the words, or the word most
frequently selected when previously meetings with the same error).

Lastly, an interactive stage makes possible the user to supervise the
correction. He can adopt one of the three following attitudes:
-       to correct the erroneous word by selecting one o f the proposed 
corrections;
-       to modify the erroneous word;
-       to not correct; in this last case, he can add this word to his personal
dictionary.

Specific needs

There are already spelling correctors for some African languages (Microsoft
announced in 2004 the marketing of an spelling corrector for Word in
Kiswahili), but they are generally very simple: they use existing spelling
correctors, providing a lexicon corresponding to the targeted language [Van
der Veken 2003]. (...)
In addition, it must have additional functionalities compared to the usual
spelling correctors in order to take into account the linguistic context of
the African languages. On the one hand, it can take part in the dissemination
of linguistic information by accompanying the proposals for corrections of
additional linguistic information; on the other hand, it can contribute to the
constitution of linguistic resources by collecting data intended for
linguists.
Lastly, a spelling corrector encounters, by definition, many texts, and is
provided with a lexicon which enables to identify the words absent from the
lexicon. We mentioned that a user can add words that are correct, but absent
from general lexicon, to his personal lexicon. We wish to exploit this process
of enrichment in order to help the institutions responsible for the languages
to increase the official lexicon of a language. When adding a word to the
personal lexicon, the spelling corrector can memorize this word in a file, as
well as the sentence in which it appears. The user is highly encouraged to
transmit this file to the institution charged with work on the language. The
information contained in this type of file can be used to enrich the lexicon
of the language (see �support for the production of linguistic resources�).
This spelling corrector is, to a certain extent, independent of the language
since the language dependent treatments (like the calculation of the
inflections and derivations of words for example) are described in generic
modules which use information gathered in the electronic lexicon of the
language. Each item is, as far as possible, accompanied by its grammatical
category, by its mode of inflection and by its possible derivations in order
to be able to extend the lexicon to all the forms. Contextual information is
also integrated into the lexicon in the form of probabilities. To adapt this
spelling corrector to a new language, it is thus enough to change the lexical
resources.




Chantal ENGUEHARD
LINA
2, rue de la Houssini�re
BP 92208
44322 Nantes Cedex 03
France


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[lingu-dev] new functionality

Reply via email to