Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Francis Tyers Sat, 12 Nov 2011 14:19:23 -0800

El ds 12 de 11 de 2011 a les 21:15 +0000, en/na Kevin Donnelly va
escriure:
> Hi Mikel
> 
> ::::On Saturday 12 November 2011 Mikel Forcada said::::
> > Al 11/12/2011 10:31 AM, En/na Kevin Donnelly ha escrit:
> > > FWIW, I think the fundamental problem is that the format of the
> > > dictionaries is non-optimal from a linguistic point of view.
> > 
> > Kevin, it would be good to hear a bit more detail about how it would be
> > improved, as at some point we should revive the process towards
> > unification and standardization of metadix.
> 
> Caveat - I am not a CS specialist, and it's a couple of years since I worked 
> directly on the Apertium format, so my memory may be hazy, or things may have 
> changed. :-)
> 
> Also, sorry for the length ....


On the contrary, thanks for a long, and thoughtful email! Lots of things
to think about :)

> Current situation (as I see it)
> ------------------------------------------
> 
> I think there is a bundle of issues which combine to make things non-optimal:
> -- the format requires words to be segmented; 
> -- the dictionary boundary doesn't necessarily align with morpheme boundaries;

This is the most problematic part as I see it. It makes dealing with
languages with stem-internal changes quite a nuisance. Unless you're
just dealing with converting from an existing resource. e.g. for Dutch,
we used Wiktionary. 

> -- many words are handled indirectly via paradigms.
> 
> The result is that expanding the dictionaries is actually quite involved - 
> you 
> examine the word-list, decide on paradigms and code them up, assign the words 
> to paradigms, code up those that fit the paradigms, and code individually any 
> words that do not fit into a paradigm.
> 
> However, you can't necessarily use the paradigms you find in grammar-books, 
> because the format uses orthographic instead of morphemic boundaries, so you 
> may have to refactor the paradigms first, which is a non-trivial task in my 
> experience.

Agree. 

> There is also the point that most speakers do not think in terms of paradigms 
> anyway - they just "know" that a particular form "sounds" right - and working 
> out the finer points of inflected tenses or locatives for rarely-used words 
> is 
> often not a trivial exercise either.
>
> *Some* of this can be simplified via scripting - but then you need to be able 
> to script, and in my experience the number of linguists (let alone interested 
> members of the public) in any given location who can use regular expressions 
> (never mind scripting!) can be counted on the fingers of half a hand. 

True.

> I also found updating the dictionaries even more difficult than creating 
> them, 
> but that's just a personal view based on my loathing of XML, and I accept 
> that 
> others probably find it simplicity itself. :-)

I find it to be the case too. Partly because as you code dictionaries
they get their own idiosyncracies, hacks which solve a problem, but
aren't necessarily elegant. A new dictionary, without any hacks is
elegant, but then it often still has many problems.

> So, what to do?  I would suggest a few things to make dictionary maintenance 
> an order of magnitude easier:
> 
> 
> (a) Remove paradigms from the dictionary. 
> -------------------------------------------------------------
> 
> In effect, you are splitting words artificially (not along linguistically-
> accepted lines) on input, so that you can put them back together again at 
> lookup.  It would be simpler just to enter and look up a full-form word.
>
> Paradigms serve no useful purpose at this location.  They belong more to 
> grammar (or at least morphology) than lexicography.  (It is true that a Latin 
> dictionary will show mensa, -ae, as an aide-mémoire, but I think that if 
> unlimited paper had been available in the old days, they would have written 
> out each form in full, not mens- as a headword, and then -a, -am, -arum, etc.)
> 
> Paradigms may be quite useful in languages like Polish or Latvian, with a 
> highly inflected system, but they are neither use nor ornament in analytic 
> languages like English (only rudimentary inflections left) or Chinese, or 
> agglutinative languages (Bantu or American languages).

I think paradigms are useful for languages ranging in inflection from
Catalan--Polish. 

For Finnish, Basque, Turkish etc. they're not so useful, as you can
basically decide the inflection based on word category, and then do the
phonology separately. But then the word category basically becomes the
"paradigm". 

For English, I don't think it really hurts that much. For regular nouns,
you have one paradigm, for regular verbs another. The names could
probably be clearer (e.g. "regular noun" instead of "house__n") but that
could equally be a comment. 

> See (c) below for how to fill the paradigm-shaped gap.
> 
> 
> (b) Make a grid the standard format for input.
> -----------------------------------------------------------------
> 
> Most people are quite familiar with tables, and in fact a dictionary entry is 
> a squished-up table.  (And it cannot be a coincidence that all language 
> dictionaries are presented in this format.)  So if you tell helpers "this 
> column is for the word, this is for the meaning, this is for the declension, 
> this is for the gender", it can be easily grasped.  At one stroke you have 
> moved the work from "something that requires technical knowledge" to 
> "something that I use every day".

Yes.

> A grid can be accessed in a spreadsheet, a database, a table in a word-
> processor, or a text-editor (if you separate each column with a tab), so it 
> requires no specialised software.  It does not distract the user with node-
> names, or confound him with a missing bracket.  (I know there are GUI 
> interfaces, but (1) they need to be installed, and (2) in my experience, they 
> are slow to work with.)

Agree, the current GUIs are more hassle than they are worth.  

> The benefit of a grid is that it varies only minorly between languages of 
> completely different families (again, this cannot be a coincidence) - in 
> other 
> words, a basic template, extended as necessary, will go a long way (see the 
> NoDaLiDa paper at http://siarad.org.uk/publications.php for Spanish/Welsh and 
> English, and it works OK with Swahili too in the verb segmenter).
> 
> The drudgery of adding words remains, though: I just added 1,500 new words to 
> the Spanish and Welsh dictionaries for the autoglosser, and the average time 
> to tidy, check a printed dictionary, and add was just over a minute per word 
> (about 29 hours).

Once you are up to speed with lttoolbox, the time is about the same, or
less per word. For the bilingual dictionary, it is basically a matter of
copy/paste/edit. 

> However, updating a dictionary in a grid format is trivial.
> 
> 
> (c) Instead of devising an interface to the current format, devise upstream 
> tools for populating a grid format.
> --------------------------------------------------

What I think here is that we can make the grid format a way of coding
data that will later be converted into lttoolbox format. Really, no-one
should be editting XML, it should be just an intermediate layer between
the grid, and the binary format. 

By the way, on the idea of a grid format, it's pretty common in
grammars, and also when trying to explain/elicit.

http://ilazki.thinkgeek.co.uk/~spectre/chuvash_table.jpg

> Paradigms (in this view) are gone, but paradigms are still useful for some 
> languages, where entering multiple cases for a noun (for example) is rather 
> tedious.  So produce tools to generate common (they don't have to be all-
> encompassing) forms based on a couple of column entries.  For Latin you might 
> have lexeme-root (mens), nominative singular (a), genitive singular (ae), 
> declension (1), and then have a generator that uses those to fill in the 
> other 
> forms, all of which are added full-form to the dictionary grid.  For a Bantu 
> language you might have the lexeme (mti), and word-class (3), and generate 
> the 
> plural (4, miti).

This is basically how the Bengali analyser works, and the system for
generating forms of Maltese verbs. See for example:

        {'stem': 'xaqq', 'type': 'doubled', 'gloss': 'crack', 'root': 'x-q-q',
'vowel_perf': 'a-a', 'vowel_impf': 'i-o', 'trans': 'tv', 'pp': 'mi'},

https://apertium.svn.sourceforge.net/svnroot/apertium/staging/apertium-mt-he/dev/verb.py

> The benefit of this is that it's much easier for helpers to get started with 
> a 
> few manually-added "common" words in the grid, and then move progressively 
> towards complete coverage (at the point where adding minimally-differentiated 
> words becomes more tedious than trying to work out rules for recurrent 
> changes).  For example, depending on source text, the subjunctive and past 
> historic tenses in French may be relatively low priority.
> 
> The generators may also be useful tools for other purposes apart from 
> Apertium.
> 
> 
> (d) Conversely, do trivial stemming as part of the lookup.
> ---------------------------------------------------------------------------------
> 
> Certain recurrent variations don't merit the name of paradigms, but may not 
> need to be in the dictionary either.  These could be handled by minimalist 
> regexes (though HFST, recently mentioned on this list, might be a candidate 
> for more heavyweight work).
> 
> For instance, I think over 85% of the verbforms in the Apertium Spanish 
> dictionary are forms with clitic pronouns, which really don't need to be 
> there 
> (so I've taken them out).  Most English verbforms (walks, walked, walking) 
> don't need to be in the English dictionary (though I have more to do on 
> that).  
> In morphemically fairly regular languages like Spanish or Italian, a word 
> ending in -a (eg a feminine adjective) or -ito (a diminutive) that does not 
> appear in the dictionary can have the ending switched to -o to see if 
> anything 
> like that is in the dictionary, and so on.
> 
> Again, the benefit of this is that it can be progressively applied as time or 
> requirement permits - it's not something that has to be done all at once at 
> the beginning.

I'm more or less against this in terms of MT systems, for morphological
analysis in general I think it is a good idea, but for an MT system,
then you have to figure out the translation, which may not be regular,
or trivial.

> (e) Develop a set of quickstart templates for particular language-types
> ---------------------------------------------------------------------------
> 
> It would be worth stepping back to consider what pieces of information about 
> particular languages need to be recorded in the dictionary, and why.  For 
> large swathes of languages, the required information will be almost 
> identical, 
> showing only minor differences (if any) between languages in the same family, 
> and greater differences (though less extensive than might be expected) 
> between 
> language-groups.

This is a great idea, and something like what we try and do with the
incubator. 

> The idea would be to offer helpers a grid template that would be likely to 
> suit 
> their language, and let them start on that.  Inevitably, some additions may 
> be 
> required, but these could be made organically, and fed back into the template 
> resources.  This would also be a good entrée towards trying to engage 
> linguists as well as fellow CS/MT people - since Apertium is an RBMT rather 
> than an SMT system, any input from them will be doubly effective.

I had an idea similar to this, using some Wiki software. If you look at
a lot of Wiktionaries, they have a similar layout, with inflections in
tables, and using templates to reuse similar inflection classes. I think
this would be most useful for open classes. And it's fairly
straightforward to get the hang of. Words could be classified not only
by templates for inflection, but also by categories for other features.

Also, see for example in the Apertium Wiki:

http://wiki.apertium.org/wiki/Category:%D0%A1%D3%91%D0%BC%D0%B0%D1%85%
D1%81%D0%B0%D1%80

This is generated automatically from a set of templates and a full-form
list. 

Another thing to think of:

For the bilingual dictionary work, I think that separating out the
"default translations" from the main dictionary could make a lot of
sense. It would allow "normal people" to add possible translations
without having to think of how they effect other entries (e.g. working
out the direction restrictions). 

My response may have been rambling, so I'll try and sum it up:

* Grids/tables are good
* I don't think it is necessary to get rid of the XML format, just
abstract away from it.
* I think that to start with, it might be good to make a grid interface
to make a _new_ language pair, rather than try and make a grid interface
to edit existing language pairs.  
** Existing language pairs have a lot of cruft, which makes developing
an interface difficult.

Fran


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Reply via email to