Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Mikel Forcada Sun, 13 Nov 2011 01:12:27 -0800

Dear Kevin, dear Apertiumers:

as Francis said, thanks for such a long and detailed answer. It took mea while to read it and I think I won't comment all you said, especiallysince there are things I don't really understand well and I should ask.But here's my first thoughts after a first read.

*A bit of history*: You know how the Apertium dictionary format wascreated. It is, basically, the XML version of the old dictionaries weused in interNOSTRUM.com and traductor.universia.net to translatebetween Spanish and Catalan and Spanish and Portuguese, an ad-hoc formatthat was invented to build an MT system in one year back in 1999. Theideas that shape up the design of those dictionaries (and Apertium 1)are (a) 12 years old and (b) strongly influenced by how we did thingsback then for very similar languages. I was only starting to do machinetranslation then. Analogously, t1x is just a rewriting and updating ofthe "morphtrans" language used there for transfer.

The result is a format, that we now call "dix", that happened to be muchmore powerful than what was needed for those inter-Romance translationtasks, but not so powerful to deal with other (for instance,non-catenative) morphologies.

*A [limited] success*: But the main results, despite the difficulties,is a success: many language pairs have been built for Apertium. Peoplelearn the format, and (painstakingly) write dictionaries, even if, asFran says, XML is not meant for humans to deal with in a day to daybasis. Maintenance is difficult, dictionaries are different from onelanguage to another, as "dix" is (ironically) too powerful a formalism.Its limited power to deal with other languages has been stretched to thelimit to build dictionaries (some of which are, granted, unreadable andhard to maintain). Metadix formats have been designed to deal withstem-vowel alternations, accents, and what not. This is proof that we doneed a general, more abstract format, from which dixes could be created.We just haven't found a good way to.

A 12-year curse*: Good old interNOSTRUM had its GUI to introduce newwords. And early in Apertium Fran set up a webpage too to add entries.They used "some selected inflected forms" to help the user decide on theparadigm. But they were not general enough. GUIs have been tried, aswell as a database format. I don't have to tell you they didn't succeed:we would have a tool if they had! Currently a student, under thesupervision of Juan Antonio and me, is trying to revive the idea of"easy dictionary maintenance" that didn't work either as a GSoC projectI mentored in 2010. The idea is to separate those parts of thedictionaries that can be dealt with using something like a "grid"(spreadsheet, database, you name it) and making it possible to add"regular", single-word entries to dictionaries (with a config file thatwould be unique for each language pair and be distributed with it).We'll see how it goes. It's a 12-year-old curse we have and we aretrying hard to shake it off.

*Full forms and morpheme boundaries*: As Jim said, there is noimpediment to adding full-form entries (such as the ones used byFreeling). Also, the fact that current dictionaries do not segmentwords at morpheme boundaries is just because people who chose to writethem didn't think that way. Also, I suspect that the idea of "morphemeboundary" may not be a trivial one. I know because initially I insisted(in interNOSTRUM) that they should try their best to havelinguistically-inspired segmentations, and I remember it being nottrivial at all, even for Spanish and Catalan.

Paradigms and models:* Maybe my thinking is spoiled by the fact that Ihave always thought of monolingual dictionaries as morphologicaldictionaries, and therefore, I don't see an easy way to do away withsome reference to paradigms. Also, abstraction was not probably the maingoal of the original design, but rather feasibility (a translator in 1year).

I understand that for most words, it would probably be enough to say howthey inflect, and in fact, many entries like


<e lm="liberación"><i>liberaci</i><par n="acci/ón__n"/></e>

basically say so, and could be more abstractly entered as:

<e>liberación<par n="acci/ón_n"/><e>

which just specifies that the word "liberación" inflects as "acción"...oh well... it does specify other things such as the fact that liberaciónis a (feminine) noun as "acción" is (this info comes inside theparadigm). And yes, for a big part of the dictionary, entries like thesecould work, and easily converted to the current dix format forcompiling. This could be a nice addition to a metadix format. And theycould separate morphology too.


We could also do away with XML and have something like

liberación as acción

liberalidad as accesibilidad

lágrima as abeja

and some intelligent way to process what to do to "liberación" to createall its full forms and their lexical forms. In fact, one could have"liberación as acción" and later on "retroferbulación as liberación" andthe compiler could solve the indirect example to generate


<e lm="retroferbulación"><i>retroferbulaci</i><par n="acci/ón__n"/></e>

But to be able to do that, a certain "paradigm" apparatus should alreadybe present, and also the whole "irregular" part of the language...


How could we do the same for bilingual dictionaries?

For those of you with a programming background, "dix" is likeprogramming in assembler. We need a high-level language but since dixcompilers are fast and well written, perhaps "dix" may still have aplace as an intermediate

format.

*A note on paradigms and speed*: Using paradigms speeds compiling _alot_. Sergio Ortiz, who redesigned completely the dictionary compilerfor Apertium can tell you why.

*On enclitics in the dictionary*: don't forget that Apertium does"tokenize as you analyse".

Somewhere, if not in the dix, you should have a way to deal with way inwhich Italian, Spanish or Portuguese deal with the orthography of verbshaving enclitic pronouns.


/(es) démonoslos → demos + nos + los
(es) bésame → besa + me
(es) mátalo → mata + lo/

One way would be to have some kind of pattern BEFORE the dictionaries todeal with that:


/-émonoslos → -emos + nos + los

-é[C]ame → e[C]a + me # [C] is any valid consonantcluster, defined in advance

-á[C]alo → a[C]a + lo/

These "suffix" transformations could be dealt with using "dix" regularexpressions and some kind of "suffix procesing", not too different frompostgeneration, in fact. Many of these things could be done with minorchanges to the current ways in which we deal with Apertium, with newmodes....

I think this is an interesting discussion and that we should have somekind of "rethinking apertium" project or conference or un-conference.Many of these limitations, such as the ones Kevin pointed at, areclearly hindering the expansion of Apertium.


Cheers

Mikel


Al 11/12/2011 10:15 PM, En/na Kevin Donnelly ha escrit:

Hi Mikel

::::On Saturday 12 November 2011 Mikel Forcada said::::

Al 11/12/2011 10:31 AM, En/na Kevin Donnelly ha escrit:

FWIW, I think the fundamental problem is that the format of the
dictionaries is non-optimal from a linguistic point of view.

Kevin, it would be good to hear a bit more detail about how it would be
improved, as at some point we should revive the process towards
unification and standardization of metadix.

Caveat - I am not a CS specialist, and it's a couple of years since I worked
directly on the Apertium format, so my memory may be hazy, or things may have
changed. :-)

Also, sorry for the length ....

Current situation (as I see it)
------------------------------------------

I think there is a bundle of issues which combine to make things non-optimal:
-- the format requires words to be segmented;
-- the dictionary boundary doesn't necessarily align with morpheme boundaries;
-- many words are handled indirectly via paradigms.

The result is that expanding the dictionaries is actually quite involved - you
examine the word-list, decide on paradigms and code them up, assign the words
to paradigms, code up those that fit the paradigms, and code individually any
words that do not fit into a paradigm.

However, you can't necessarily use the paradigms you find in grammar-books,
because the format uses orthographic instead of morphemic boundaries, so you
may have to refactor the paradigms first, which is a non-trivial task in my
experience.

There is also the point that most speakers do not think in terms of paradigms
anyway - they just "know" that a particular form "sounds" right - and working
out the finer points of inflected tenses or locatives for rarely-used words is
often not a trivial exercise either.

*Some* of this can be simplified via scripting - but then you need to be able
to script, and in my experience the number of linguists (let alone interested
members of the public) in any given location who can use regular expressions
(never mind scripting!) can be counted on the fingers of half a hand.

I also found updating the dictionaries even more difficult than creating them,
but that's just a personal view based on my loathing of XML, and I accept that
others probably find it simplicity itself. :-)

So, what to do? I would suggest a few things to make dictionary maintenance
an order of magnitude easier:

(a) Remove paradigms from the dictionary.
-------------------------------------------------------------

In effect, you are splitting words artificially (not along linguistically-
accepted lines) on input, so that you can put them back together again at
lookup. It would be simpler just to enter and look up a full-form word.

Paradigms serve no useful purpose at this location. They belong more to
grammar (or at least morphology) than lexicography. (It is true that a Latin
dictionary will show mensa, -ae, as an aide-mémoire, but I think that if
unlimited paper had been available in the old days, they would have written
out each form in full, not mens- as a headword, and then -a, -am, -arum, etc.)

Paradigms may be quite useful in languages like Polish or Latvian, with a
highly inflected system, but they are neither use nor ornament in analytic
languages like English (only rudimentary inflections left) or Chinese, or
agglutinative languages (Bantu or American languages).

See (c) below for how to fill the paradigm-shaped gap.

(b) Make a grid the standard format for input.
-----------------------------------------------------------------

Most people are quite familiar with tables, and in fact a dictionary entry is
a squished-up table. (And it cannot be a coincidence that all language
dictionaries are presented in this format.) So if you tell helpers "this
column is for the word, this is for the meaning, this is for the declension,
this is for the gender", it can be easily grasped. At one stroke you have
moved the work from "something that requires technical knowledge" to
"something that I use every day".

A grid can be accessed in a spreadsheet, a database, a table in a word-
processor, or a text-editor (if you separate each column with a tab), so it
requires no specialised software. It does not distract the user with node-
names, or confound him with a missing bracket. (I know there are GUI
interfaces, but (1) they need to be installed, and (2) in my experience, they
are slow to work with.)

The benefit of a grid is that it varies only minorly between languages of
completely different families (again, this cannot be a coincidence) - in other
words, a basic template, extended as necessary, will go a long way (see the
NoDaLiDa paper at http://siarad.org.uk/publications.php for Spanish/Welsh and
English, and it works OK with Swahili too in the verb segmenter).

The drudgery of adding words remains, though: I just added 1,500 new words to
the Spanish and Welsh dictionaries for the autoglosser, and the average time
to tidy, check a printed dictionary, and add was just over a minute per word
(about 29 hours).

However, updating a dictionary in a grid format is trivial.

(c) Instead of devising an interface to the current format, devise upstream
tools for populating a grid format.
--------------------------------------------------

Paradigms (in this view) are gone, but paradigms are still useful for some
languages, where entering multiple cases for a noun (for example) is rather
tedious. So produce tools to generate common (they don't have to be all-
encompassing) forms based on a couple of column entries. For Latin you might
have lexeme-root (mens), nominative singular (a), genitive singular (ae),
declension (1), and then have a generator that uses those to fill in the other
forms, all of which are added full-form to the dictionary grid. For a Bantu
language you might have the lexeme (mti), and word-class (3), and generate the
plural (4, miti).

The benefit of this is that it's much easier for helpers to get started with a
few manually-added "common" words in the grid,and then move progressively
towards complete coverage (at the point where adding minimally-differentiated
words becomes more tedious than trying to work out rules for recurrent
changes). For example, depending on source text, the subjunctive and past
historic tenses in French may be relatively low priority.

The generators may also be useful tools for other purposes apart from
Apertium.

(d) Conversely, do trivial stemming as part of the lookup.
---------------------------------------------------------------------------------

Certain recurrent variations don't merit the name of paradigms, but may not
need to be in the dictionary either. These could be handled by minimalist
regexes (though HFST, recently mentioned on this list, might be a candidate
for more heavyweight work).

For instance, I think over 85% of the verbforms in the Apertium Spanish
dictionary are forms with clitic pronouns, which really don't need to be there
(so I've taken them out). Most English verbforms (walks, walked, walking)
don't need to be in the English dictionary (though I have more to do on that).
In morphemically fairly regular languages like Spanish or Italian, a word
ending in -a (eg a feminine adjective) or -ito (a diminutive) that does not
appear in the dictionary can have the ending switched to -o to see if anything
like that is in the dictionary, and so on.

Again, the benefit of this is that it can be progressively applied as time or
requirement permits - it's not something that has to be done all at once at
the beginning.

(e) Develop a set of quickstart templates for particular language-types
-----------------------------------------------------------------------------------------------------

It would be worth stepping back to consider what pieces of information about
particular languages need to be recorded in the dictionary, and why. For
large swathes of languages, the required information will be almost identical,
showing only minor differences (if any) between languages in the same family,
and greater differences (though less extensive than might be expected) between
language-groups.

The idea would be to offer helpers a grid template that would be likely to suit
their language, and let them start on that. Inevitably, some additions may be
required, but these could be made organically, and fed back into the template
resources. This would also be a good entrée towards trying to engage
linguists as well as fellow CS/MT people - since Apertium is an RBMT rather
than an SMT system, any input from them will be doubly effective.

For better or worse, that's my tupporth. :-) I think Apertium is a tremendous
resource, not least because of the collection of data that the project has
amassed. With Google now beginning to charge for its translator, Apertium is
probably best-placed to become THE open translator of choice, though of course
there's a distance to go yet.



--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Reply via email to