Hello, I'm researcher in the field of Natural Language Processing, and especially towards african language.
I'd like to modify the spell checker: when it detects an error it could present some additional information with each possible correction. (This idea is developped in the text below which is a part of an article presented in the last international Unicode Conference in Berlin). I understood that the spell checker of OpenOffice is based on Ispell and that I should modify Ispell to give this new functionality : presenting additional information with possoble corrections. I found also some documentation about the structure of the dictionaries. I contact you to suggest my project and to get help about what to do in order to reach my goal. Thanks in advance Chantal Enguehard ------------------------------------------------------------------- 3. processing, an asset for the African languages 3.1. Assumptions It is imperative to develop low cost strategies (human and financial) to make possible the constitution and diffusion of linguistic resources. We make the assumption that the people writing texts in a national language (school books, technical books, newspapers, Internet sites) could maintain a good quality of language and produce texts in greater quantity if they had access to the linguistic resources they need. However, any text, at one time or another of its development, is composed and stored in an electronic medium. We make the assumption that this systematic use of the computers represents an opportunity to seize because they represent a place to diffuse and collect linguistic data, and offer modern and elaborate tools that are able to accelerate the slow work of language transcription. 3.2. Facilitating the dissemination of linguistic information Linguistic knowledge can be presented in the form of hypertexts downloadable from the sites of the institutions, but producing hypertexts represents an too huge amount of work to make it possible in a short-term implement. It seems more judicious to create orthographical correctors within the text editors because it is possible to produce such correctors with little human work but based on corpora texts. In addition, even if the African languages are currently not well represented on the Web [Diki-Kidiri 2003], the solution of representing and displaying special characters and the development of adapted editors, should allow the creation of Internet sites written in African languages for the speakers of these languages. (...) 4. Objectives We focus on the development of spelling correctors produced from corpus of texts, and tools allowing linguists to easily strip corpora of texts in order to enrich the linguistic resources. 4.1. Spelling correctors Short state of the art Spelling correctors constitute a research orientation since the years 1960 [Kukich 1992]. They are now usually used by general public because the current text editors often integrate one, and they bring a considerable comfort during the drafting of texts. These correctors function according to an interactive mode in which the user intervenes, contrary to the automatic spelling correctors as in the field of the optical character recognition (and of which we are not concerned here). An interactive spelling corrector functions following several stages: 1. detection of the errors; 2. selection of the possible corrections; 3. ordering the possible corrections and proposition to the user; 4. effective correction of the text respecting the choice of the user. 1. The detection of errors is carried out while considering one by one the words of the text to correct, in an isolated way. (...) 2. When an error is detected, the corrector selects a series of words likely to be the correct version of the chain to be corrected. These words are selected according to various techniques (calculation of the minimal editing distance, of the key of similarity, or measurement of the phonological distance). 3. The ordering of the possible corrections takes into account the measurement used during the previous stage of selection, as well as statistical measurements (like the frequency of appearance of the words, or the word most frequently selected when previously meetings with the same error). Lastly, an interactive stage makes possible the user to supervise the correction. He can adopt one of the three following attitudes: - to correct the erroneous word by selecting one o f the proposed corrections; - to modify the erroneous word; - to not correct; in this last case, he can add this word to his personal dictionary. Specific needs There are already spelling correctors for some African languages (Microsoft announced in 2004 the marketing of an spelling corrector for Word in Kiswahili), but they are generally very simple: they use existing spelling correctors, providing a lexicon corresponding to the targeted language [Van der Veken 2003]. (...) In addition, it must have additional functionalities compared to the usual spelling correctors in order to take into account the linguistic context of the African languages. On the one hand, it can take part in the dissemination of linguistic information by accompanying the proposals for corrections of additional linguistic information; on the other hand, it can contribute to the constitution of linguistic resources by collecting data intended for linguists. Lastly, a spelling corrector encounters, by definition, many texts, and is provided with a lexicon which enables to identify the words absent from the lexicon. We mentioned that a user can add words that are correct, but absent from general lexicon, to his personal lexicon. We wish to exploit this process of enrichment in order to help the institutions responsible for the languages to increase the official lexicon of a language. When adding a word to the personal lexicon, the spelling corrector can memorize this word in a file, as well as the sentence in which it appears. The user is highly encouraged to transmit this file to the institution charged with work on the language. The information contained in this type of file can be used to enrich the lexicon of the language (see �support for the production of linguistic resources�). This spelling corrector is, to a certain extent, independent of the language since the language dependent treatments (like the calculation of the inflections and derivations of words for example) are described in generic modules which use information gathered in the electronic lexicon of the language. Each item is, as far as possible, accompanied by its grammatical category, by its mode of inflection and by its possible derivations in order to be able to extend the lexicon to all the forms. Contextual information is also integrated into the lexicon in the form of probabilities. To adapt this spelling corrector to a new language, it is thus enough to change the lexical resources. Chantal ENGUEHARD LINA 2, rue de la Houssini�re BP 92208 44322 Nantes Cedex 03 France --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
