Dear Kevin, dear Apertiumers:
as Francis said, thanks for such a long and detailed answer. It took me
a while to read it and I think I won't comment all you said, especially
since there are things I don't really understand well and I should ask.
But here's my first thoughts after a first read.
*A bit of history*: You know how the Apertium dictionary format was
created. It is, basically, the XML version of the old dictionaries we
used in interNOSTRUM.com and traductor.universia.net to translate
between Spanish and Catalan and Spanish and Portuguese, an ad-hoc format
that was invented to build an MT system in one year back in 1999. The
ideas that shape up the design of those dictionaries (and Apertium 1)
are (a) 12 years old and (b) strongly influenced by how we did things
back then for very similar languages. I was only starting to do machine
translation then. Analogously, t1x is just a rewriting and updating of
the "morphtrans" language used there for transfer.
The result is a format, that we now call "dix", that happened to be much
more powerful than what was needed for those inter-Romance translation
tasks, but not so powerful to deal with other (for instance,
non-catenative) morphologies.
*A [limited] success*: But the main results, despite the difficulties,
is a success: many language pairs have been built for Apertium. People
learn the format, and (painstakingly) write dictionaries, even if, as
Fran says, XML is not meant for humans to deal with in a day to day
basis. Maintenance is difficult, dictionaries are different from one
language to another, as "dix" is (ironically) too powerful a formalism.
Its limited power to deal with other languages has been stretched to the
limit to build dictionaries (some of which are, granted, unreadable and
hard to maintain). Metadix formats have been designed to deal with
stem-vowel alternations, accents, and what not. This is proof that we do
need a general, more abstract format, from which dixes could be created.
We just haven't found a good way to.
*
A 12-year curse*: Good old interNOSTRUM had its GUI to introduce new
words. And early in Apertium Fran set up a webpage too to add entries.
They used "some selected inflected forms" to help the user decide on the
paradigm. But they were not general enough. GUIs have been tried, as
well as a database format. I don't have to tell you they didn't succeed:
we would have a tool if they had! Currently a student, under the
supervision of Juan Antonio and me, is trying to revive the idea of
"easy dictionary maintenance" that didn't work either as a GSoC project
I mentored in 2010. The idea is to separate those parts of the
dictionaries that can be dealt with using something like a "grid"
(spreadsheet, database, you name it) and making it possible to add
"regular", single-word entries to dictionaries (with a config file that
would be unique for each language pair and be distributed with it).
We'll see how it goes. It's a 12-year-old curse we have and we are
trying hard to shake it off.
*Full forms and morpheme boundaries*: As Jim said, there is no
impediment to adding full-form entries (such as the ones used by
Freeling). Also, the fact that current dictionaries do not segment
words at morpheme boundaries is just because people who chose to write
them didn't think that way. Also, I suspect that the idea of "morpheme
boundary" may not be a trivial one. I know because initially I insisted
(in interNOSTRUM) that they should try their best to have
linguistically-inspired segmentations, and I remember it being not
trivial at all, even for Spanish and Catalan.
*
Paradigms and models:* Maybe my thinking is spoiled by the fact that I
have always thought of monolingual dictionaries as morphological
dictionaries, and therefore, I don't see an easy way to do away with
some reference to paradigms. Also, abstraction was not probably the main
goal of the original design, but rather feasibility (a translator in 1
year).
I understand that for most words, it would probably be enough to say how
they inflect, and in fact, many entries like
<e lm="liberación"><i>liberaci</i><par n="acci/ón__n"/></e>
basically say so, and could be more abstractly entered as:
<e>liberación<par n="acci/ón_n"/><e>
which just specifies that the word "liberación" inflects as "acción"...
oh well... it does specify other things such as the fact that liberación
is a (feminine) noun as "acción" is (this info comes inside the
paradigm). And yes, for a big part of the dictionary, entries like these
could work, and easily converted to the current dix format for
compiling. This could be a nice addition to a metadix format. And they
could separate morphology too.
We could also do away with XML and have something like
liberación as acción
liberalidad as accesibilidad
lágrima as abeja
and some intelligent way to process what to do to "liberación" to create
all its full forms and their lexical forms. In fact, one could have
"liberación as acción" and later on "retroferbulación as liberación" and
the compiler could solve the indirect example to generate
<e lm="retroferbulación"><i>retroferbulaci</i><par n="acci/ón__n"/></e>
But to be able to do that, a certain "paradigm" apparatus should already
be present, and also the whole "irregular" part of the language...
How could we do the same for bilingual dictionaries?
For those of you with a programming background, "dix" is like
programming in assembler. We need a high-level language but since dix
compilers are fast and well written, perhaps "dix" may still have a
place as an intermediate
format.
*A note on paradigms and speed*: Using paradigms speeds compiling _a
lot_. Sergio Ortiz, who redesigned completely the dictionary compiler
for Apertium can tell you why.
*On enclitics in the dictionary*: don't forget that Apertium does
"tokenize as you analyse".
Somewhere, if not in the dix, you should have a way to deal with way in
which Italian, Spanish or Portuguese deal with the orthography of verbs
having enclitic pronouns.
/(es) démonoslos → demos + nos + los
(es) bésame → besa + me
(es) mátalo → mata + lo/
One way would be to have some kind of pattern BEFORE the dictionaries to
deal with that:
/-émonoslos → -emos + nos + los
-é[C]ame → e[C]a + me # [C] is any valid consonant
cluster, defined in advance
-á[C]alo → a[C]a + lo/
These "suffix" transformations could be dealt with using "dix" regular
expressions and some kind of "suffix procesing", not too different from
postgeneration, in fact. Many of these things could be done with minor
changes to the current ways in which we deal with Apertium, with new
modes....
I think this is an interesting discussion and that we should have some
kind of "rethinking apertium" project or conference or un-conference.
Many of these limitations, such as the ones Kevin pointed at, are
clearly hindering the expansion of Apertium.
Cheers
Mikel
Al 11/12/2011 10:15 PM, En/na Kevin Donnelly ha escrit:
Hi Mikel
::::On Saturday 12 November 2011 Mikel Forcada said::::
Al 11/12/2011 10:31 AM, En/na Kevin Donnelly ha escrit:
FWIW, I think the fundamental problem is that the format of the
dictionaries is non-optimal from a linguistic point of view.
Kevin, it would be good to hear a bit more detail about how it would be
improved, as at some point we should revive the process towards
unification and standardization of metadix.
Caveat - I am not a CS specialist, and it's a couple of years since I worked
directly on the Apertium format, so my memory may be hazy, or things may have
changed. :-)
Also, sorry for the length ....
Current situation (as I see it)
------------------------------------------
I think there is a bundle of issues which combine to make things non-optimal:
-- the format requires words to be segmented;
-- the dictionary boundary doesn't necessarily align with morpheme boundaries;
-- many words are handled indirectly via paradigms.
The result is that expanding the dictionaries is actually quite involved - you
examine the word-list, decide on paradigms and code them up, assign the words
to paradigms, code up those that fit the paradigms, and code individually any
words that do not fit into a paradigm.
However, you can't necessarily use the paradigms you find in grammar-books,
because the format uses orthographic instead of morphemic boundaries, so you
may have to refactor the paradigms first, which is a non-trivial task in my
experience.
There is also the point that most speakers do not think in terms of paradigms
anyway - they just "know" that a particular form "sounds" right - and working
out the finer points of inflected tenses or locatives for rarely-used words is
often not a trivial exercise either.
*Some* of this can be simplified via scripting - but then you need to be able
to script, and in my experience the number of linguists (let alone interested
members of the public) in any given location who can use regular expressions
(never mind scripting!) can be counted on the fingers of half a hand.
I also found updating the dictionaries even more difficult than creating them,
but that's just a personal view based on my loathing of XML, and I accept that
others probably find it simplicity itself. :-)
So, what to do? I would suggest a few things to make dictionary maintenance
an order of magnitude easier:
(a) Remove paradigms from the dictionary.
-------------------------------------------------------------
In effect, you are splitting words artificially (not along linguistically-
accepted lines) on input, so that you can put them back together again at
lookup. It would be simpler just to enter and look up a full-form word.
Paradigms serve no useful purpose at this location. They belong more to
grammar (or at least morphology) than lexicography. (It is true that a Latin
dictionary will show mensa, -ae, as an aide-mémoire, but I think that if
unlimited paper had been available in the old days, they would have written
out each form in full, not mens- as a headword, and then -a, -am, -arum, etc.)
Paradigms may be quite useful in languages like Polish or Latvian, with a
highly inflected system, but they are neither use nor ornament in analytic
languages like English (only rudimentary inflections left) or Chinese, or
agglutinative languages (Bantu or American languages).
See (c) below for how to fill the paradigm-shaped gap.
(b) Make a grid the standard format for input.
-----------------------------------------------------------------
Most people are quite familiar with tables, and in fact a dictionary entry is
a squished-up table. (And it cannot be a coincidence that all language
dictionaries are presented in this format.) So if you tell helpers "this
column is for the word, this is for the meaning, this is for the declension,
this is for the gender", it can be easily grasped. At one stroke you have
moved the work from "something that requires technical knowledge" to
"something that I use every day".
A grid can be accessed in a spreadsheet, a database, a table in a word-
processor, or a text-editor (if you separate each column with a tab), so it
requires no specialised software. It does not distract the user with node-
names, or confound him with a missing bracket. (I know there are GUI
interfaces, but (1) they need to be installed, and (2) in my experience, they
are slow to work with.)
The benefit of a grid is that it varies only minorly between languages of
completely different families (again, this cannot be a coincidence) - in other
words, a basic template, extended as necessary, will go a long way (see the
NoDaLiDa paper at http://siarad.org.uk/publications.php for Spanish/Welsh and
English, and it works OK with Swahili too in the verb segmenter).
The drudgery of adding words remains, though: I just added 1,500 new words to
the Spanish and Welsh dictionaries for the autoglosser, and the average time
to tidy, check a printed dictionary, and add was just over a minute per word
(about 29 hours).
However, updating a dictionary in a grid format is trivial.
(c) Instead of devising an interface to the current format, devise upstream
tools for populating a grid format.
--------------------------------------------------
Paradigms (in this view) are gone, but paradigms are still useful for some
languages, where entering multiple cases for a noun (for example) is rather
tedious. So produce tools to generate common (they don't have to be all-
encompassing) forms based on a couple of column entries. For Latin you might
have lexeme-root (mens), nominative singular (a), genitive singular (ae),
declension (1), and then have a generator that uses those to fill in the other
forms, all of which are added full-form to the dictionary grid. For a Bantu
language you might have the lexeme (mti), and word-class (3), and generate the
plural (4, miti).
The benefit of this is that it's much easier for helpers to get started with a
few manually-added "common" words in the grid,and then move progressively
towards complete coverage (at the point where adding minimally-differentiated
words becomes more tedious than trying to work out rules for recurrent
changes). For example, depending on source text, the subjunctive and past
historic tenses in French may be relatively low priority.
The generators may also be useful tools for other purposes apart from
Apertium.
(d) Conversely, do trivial stemming as part of the lookup.
---------------------------------------------------------------------------------
Certain recurrent variations don't merit the name of paradigms, but may not
need to be in the dictionary either. These could be handled by minimalist
regexes (though HFST, recently mentioned on this list, might be a candidate
for more heavyweight work).
For instance, I think over 85% of the verbforms in the Apertium Spanish
dictionary are forms with clitic pronouns, which really don't need to be there
(so I've taken them out). Most English verbforms (walks, walked, walking)
don't need to be in the English dictionary (though I have more to do on that).
In morphemically fairly regular languages like Spanish or Italian, a word
ending in -a (eg a feminine adjective) or -ito (a diminutive) that does not
appear in the dictionary can have the ending switched to -o to see if anything
like that is in the dictionary, and so on.
Again, the benefit of this is that it can be progressively applied as time or
requirement permits - it's not something that has to be done all at once at
the beginning.
(e) Develop a set of quickstart templates for particular language-types
-----------------------------------------------------------------------------------------------------
It would be worth stepping back to consider what pieces of information about
particular languages need to be recorded in the dictionary, and why. For
large swathes of languages, the required information will be almost identical,
showing only minor differences (if any) between languages in the same family,
and greater differences (though less extensive than might be expected) between
language-groups.
The idea would be to offer helpers a grid template that would be likely to suit
their language, and let them start on that. Inevitably, some additions may be
required, but these could be made organically, and fed back into the template
resources. This would also be a good entrée towards trying to engage
linguists as well as fellow CS/MT people - since Apertium is an RBMT rather
than an SMT system, any input from them will be doubly effective.
For better or worse, that's my tupporth. :-) I think Apertium is a tremendous
resource, not least because of the collection of data that the project has
amassed. With Google now beginning to charge for its translator, Apertium is
probably best-placed to become THE open translator of choice, though of course
there's a distance to go yet.
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff