[Apertium-stuff] About de-duplicating of dictionaries

Ilnar Salimzyan Tue, 27 Mar 2012 09:47:46 -0700

This thread grew out of the discussion of my proposal draft [see
"GSoC: Adopting a language pair: Tur-Tat / Kaz-Tat" from March 26].

Having discussed the problem of monodixes/lexc-files copied in many
pairs (and in more and more pairs) with Jonathan and seeing that
people at IRC come to this question quite often (Like "What lexc of
Tatar should I choose for my new Tatar-X translator?"), I decided to
start a new discussion here :)

On Mon, Mar 26, 2012 at 2:37 PM, Kevin Brubeck Unhammer
<[email protected]> wrote:

> It'd be nice to have some general method for deduplicating
> dictionaries

I think we all share the same view.

Obvious that having single transducers for many related languages
compatible with each other is great. It would facilitate creation of
new translators.
And I think that keeping them compatible on the tags/morphotactics
level can and should be done.

>>… We use a trimming script in apertium-sme-nob; with this
> method, you would have apertium-kaz and apertium-tat as just
> "development dependencies". So you'd add stuff to apertium-kaz/kaz.lexc
> and to your bidix, and then run a script from apertium-kaz-tat with the
> path to apertium-kaz and it creates a file apertium-kaz-tat/kaz.lexc
> (and you never change this file, although it's in SVN). Similarly for
> tat.lexc.
>
> This works, as long as the trimming script is well configured, but
> perhaps it'd be 'cleaner' to have apertium-kaz/apertium-tat as "make
> dependencies" and do the trimming each time you type make (no need for
> apertium-kaz-tat to have generated kaz.lexc/tat.lexc files in SVN).
>
> (The weak point in the chain is the trimming script though, which
> expects the lexc files to be fairly easily parsable (they're not,
> really). Ideally we would have ways of trimming both HFST and lttoolbox
> dictionaries so that we never had to copy-paste anything between pairs,
> but language pairs tend to have stuff in them that's rather specific to
> that pair, not sure how that is best dealt with.)

= Reasons why we have monodixes copied =
1. Historical (there weren't many pairs having common part initially,
but Apertium keeps growing);
2. Because of the stuff specific to a given pair.

= Some imaginable solutions =
Just to sum up:
1. Transducers for language A and Language B as "make-dependencies";
2. Mono-dictionaries in apertium-langA and apertium-langB as
"development-dependencies" + some trimming / duplicating /
keeping-up-to-date scripts.

= Strengths and weaknesses of each solution =
Strengths and weaknesses become clear when we 'do' need to add
language-pair-specific stuff to mono-dictionaries.

All examples that come up in mind are for Russian-Tatar (=not related
languages), so for related languages this might be not relevant. Maybe
they won't need any pair-specific-stuff in their mono-dictionaries at
all, but this sounds too good to be true :)

Consider Russian word "заговорить" ("start to talk"). To Tatar it is
translated with two words, just like to English. And in Russian-Tatar
/ Russian-English pair we will need to add "start to talk" as a
multiword.

I am sure that similar cases, when a word of languageA is translated
to languageB with a multiword, can be found for related languages too.

== 1. Make-dependencies ==
We can add such words to monodictionaries in apertium-langA,
separating them into sublexicons or commenting them like "this stuff
is needed for langA-langB pair".
But this way transducer will become noisier and noisier.

== 2. Mono-dictionaries in apertium-langA and apertium-langB as
"development-dependencies" + some trimming / duplicating /
keeping-up-to-date scripts ==
In this case monodictionaries in apertium-langX are considered to be
something like "vanilla software". They are kept close to linguistical
traditions of POS-tagging etc. And they serve as base for building new
pairs involving this languages.

Modifying them for a given pair is like patching the vanilla software.
A script could keep this modified versions in apertium-langX-langY
up-to-date with mono-dictionaries in apertium-langX and
apertium-langY.

A challenge here is not to overwrite modifications while updating.
Although script used in sme-nob solves the problem of updating, as I
understand, it will overwrite any modifications made in
apertium-sme-nob. And I am not sure if this can be done at all
technically.

The second approach seems to be better than avoiding copying
dictionaries like a plague. Especially in the long-term, if the
language gets paired with a not-so-related language.

Best,

Ilnar Salimzyan

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] About de-duplicating of dictionaries

Reply via email to