Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Kevin Donnelly Sat, 12 Nov 2011 13:15:47 -0800

Hi Mikel

::::On Saturday 12 November 2011 Mikel Forcada said::::
> Al 11/12/2011 10:31 AM, En/na Kevin Donnelly ha escrit:
> > FWIW, I think the fundamental problem is that the format of the
> > dictionaries is non-optimal from a linguistic point of view.
> 
> Kevin, it would be good to hear a bit more detail about how it would be
> improved, as at some point we should revive the process towards
> unification and standardization of metadix.

Caveat - I am not a CS specialist, and it's a couple of years since I worked
directly on the Apertium format, so my memory may be hazy, or things may have
changed. :-)

Also, sorry for the length ....

Current situation (as I see it)
------------------------------------------

I think there is a bundle of issues which combine to make things non-optimal:
-- the format requires words to be segmented;
-- the dictionary boundary doesn't necessarily align with morpheme boundaries;
-- many words are handled indirectly via paradigms.

The result is that expanding the dictionaries is actually quite involved - you
examine the word-list, decide on paradigms and code them up, assign the words
to paradigms, code up those that fit the paradigms, and code individually any
words that do not fit into a paradigm.

However, you can't necessarily use the paradigms you find in grammar-books,
because the format uses orthographic instead of morphemic boundaries, so you
may have to refactor the paradigms first, which is a non-trivial task in my
experience.

There is also the point that most speakers do not think in terms of paradigms
anyway - they just "know" that a particular form "sounds" right - and working
out the finer points of inflected tenses or locatives for rarely-used words is
often not a trivial exercise either.

*Some* of this can be simplified via scripting - but then you need to be able
to script, and in my experience the number of linguists (let alone interested
members of the public) in any given location who can use regular expressions
(never mind scripting!) can be counted on the fingers of half a hand.

I also found updating the dictionaries even more difficult than creating them,
but that's just a personal view based on my loathing of XML, and I accept that
others probably find it simplicity itself. :-)

So, what to do? I would suggest a few things to make dictionary maintenance
an order of magnitude easier:

(a) Remove paradigms from the dictionary.
-------------------------------------------------------------

In effect, you are splitting words artificially (not along linguistically-
accepted lines) on input, so that you can put them back together again at
lookup. It would be simpler just to enter and look up a full-form word.

Paradigms serve no useful purpose at this location. They belong more to
grammar (or at least morphology) than lexicography. (It is true that a Latin
dictionary will show mensa, -ae, as an aide-mémoire, but I think that if
unlimited paper had been available in the old days, they would have written
out each form in full, not mens- as a headword, and then -a, -am, -arum, etc.)

Paradigms may be quite useful in languages like Polish or Latvian, with a
highly inflected system, but they are neither use nor ornament in analytic
languages like English (only rudimentary inflections left) or Chinese, or
agglutinative languages (Bantu or American languages).

See (c) below for how to fill the paradigm-shaped gap.

(b) Make a grid the standard format for input.
-----------------------------------------------------------------

Most people are quite familiar with tables, and in fact a dictionary entry is
a squished-up table. (And it cannot be a coincidence that all language
dictionaries are presented in this format.) So if you tell helpers "this
column is for the word, this is for the meaning, this is for the declension,
this is for the gender", it can be easily grasped. At one stroke you have
moved the work from "something that requires technical knowledge" to
"something that I use every day".

A grid can be accessed in a spreadsheet, a database, a table in a word-
processor, or a text-editor (if you separate each column with a tab), so it
requires no specialised software. It does not distract the user with node-
names, or confound him with a missing bracket. (I know there are GUI
interfaces, but (1) they need to be installed, and (2) in my experience, they
are slow to work with.)

The benefit of a grid is that it varies only minorly between languages of
completely different families (again, this cannot be a coincidence) - in other
words, a basic template, extended as necessary, will go a long way (see the
NoDaLiDa paper at http://siarad.org.uk/publications.php for Spanish/Welsh and
English, and it works OK with Swahili too in the verb segmenter).

The drudgery of adding words remains, though: I just added 1,500 new words to
the Spanish and Welsh dictionaries for the autoglosser, and the average time
to tidy, check a printed dictionary, and add was just over a minute per word
(about 29 hours).

However, updating a dictionary in a grid format is trivial.

(c) Instead of devising an interface to the current format, devise upstream
tools for populating a grid format.
--------------------------------------------------

Paradigms (in this view) are gone, but paradigms are still useful for some
languages, where entering multiple cases for a noun (for example) is rather
tedious. So produce tools to generate common (they don't have to be all-
encompassing) forms based on a couple of column entries. For Latin you might
have lexeme-root (mens), nominative singular (a), genitive singular (ae),
declension (1), and then have a generator that uses those to fill in the other
forms, all of which are added full-form to the dictionary grid. For a Bantu
language you might have the lexeme (mti), and word-class (3), and generate the
plural (4, miti).

The benefit of this is that it's much easier for helpers to get started with a
few manually-added "common" words in the grid,and then move progressively
towards complete coverage (at the point where adding minimally-differentiated
words becomes more tedious than trying to work out rules for recurrent
changes). For example, depending on source text, the subjunctive and past
historic tenses in French may be relatively low priority.

The generators may also be useful tools for other purposes apart from
Apertium.

(d) Conversely, do trivial stemming as part of the lookup.
---------------------------------------------------------------------------------

Certain recurrent variations don't merit the name of paradigms, but may not
need to be in the dictionary either. These could be handled by minimalist
regexes (though HFST, recently mentioned on this list, might be a candidate
for more heavyweight work).

For instance, I think over 85% of the verbforms in the Apertium Spanish
dictionary are forms with clitic pronouns, which really don't need to be there
(so I've taken them out). Most English verbforms (walks, walked, walking)
don't need to be in the English dictionary (though I have more to do on that).
In morphemically fairly regular languages like Spanish or Italian, a word
ending in -a (eg a feminine adjective) or -ito (a diminutive) that does not
appear in the dictionary can have the ending switched to -o to see if anything
like that is in the dictionary, and so on.

Again, the benefit of this is that it can be progressively applied as time or
requirement permits - it's not something that has to be done all at once at
the beginning.

(e) Develop a set of quickstart templates for particular language-types
-----------------------------------------------------------------------------------------------------

It would be worth stepping back to consider what pieces of information about
particular languages need to be recorded in the dictionary, and why. For
large swathes of languages, the required information will be almost identical,
showing only minor differences (if any) between languages in the same family,
and greater differences (though less extensive than might be expected) between
language-groups.

The idea would be to offer helpers a grid template that would be likely to suit
their language, and let them start on that. Inevitably, some additions may be
required, but these could be made organically, and fed back into the template
resources. This would also be a good entrée towards trying to engage
linguists as well as fellow CS/MT people - since Apertium is an RBMT rather
than an SMT system, any input from them will be doubly effective.

For better or worse, that's my tupporth. :-) I think Apertium is a tremendous
resource, not least because of the collection of data that the project has
amassed. With Google now beginning to charge for its translator, Apertium is
probably best-placed to become THE open translator of choice, though of course
there's a distance to go yet.

--
Pob hwyl / Best wishes

Kevin Donnelly
kevindonnelly.org.uk

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Reply via email to