Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

Jonathan Washington Tue, 26 May 2020 09:29:25 -0700

Hi all,

After having read through and thought some on this thread, I have some
responses.

First of all, I don't care what the "default" is (i.e., whatever
apertium-init creates without flags), as long as there remains choice.  A
lot of pairs already have things set up in different ways, and I see no
problem with allowing for more variation.  So as long as everything is
backwards-compatible and nothing is affected by these changes that doesn't
want to be, then everything is fine.  One way to keep things this way is to
provide a module to allow the injection of secondary tags from surface
forms and superblanks *after* analysis, and keep secondary tag code out of
the transducer processors.

I believe Daniel's proposal for apertium-separable trimming allows for
another nice compromise.  I was skeptical of this as it was being discussed
on IRC, but Daniel's explanation in this thread clarified things (I often
engage with IRC these days while dealing with small children, and can't
necessarily follow everything as closely as I might like to...).

The one difficulty with this approach is that MWEs really do need to be
offloaded to lsx files, and those really do then need to be part of
language modules, not translation pairs.  Lsx dictionaries being part of
language modules is something I've wondered about from the start, but the
choice of which MWEs should be included is pair-specific, so it was decided
they should be part of translation pairs.  If we trim against the bidix the
same way we've been trimming the monodix (and forgo trimming the monodix),
then I think we might be able to have our cake and eat it too.

In short, it allows us to control the MWEs we use for a given language pair
by simply having them in the bidix.  With weighting of the monodix against
the bidix and not trimming the monodix, we can also have forms not in the
bidix still benefit various stages of translation without "wrong" analyses
(more often, beneficial to some uses but not necessarily a given
translation pair) interfering with tokenisation.  We just have to offload
all MWEs (most entries with spaces) from the monodix to the lsx file.

Along those lines, I'm ready to implement this for the Kazakh-Kyrgyz pair
(which is at a "staging" level of development).  What will need to be done:

- Disable trimming of monodixes,
- Enable weighting of monodixes against bidix,
- Move lsx files to respective monolingual modules,
- Merge apertium-eng-kir's kir.lsx file into the Kyrgyz monolingual lsx
file (and do remaining steps for eng-kir too),
- Move all (or most) MWEs from monolingual modules to monolingual lsx
files.  Probably for now add a comment like "Use/MWE" to the "moved"
entries in the monolingual dictionary and grep those lines out at compile
time.  This is important at least for the Kazakh transducer, which is used
in two released pairs and a number of other developed tools.

One challenge I can see with automating the moving of "MWEs" (defined here
as open-category words that have spaces) is that because of the nature of
MWEs, a number of them in any given language have elements that aren't used
elsewhere in the language, and so won't otherwise receive analyses.  My
current understanding is that if there's not an obvious way (or need) to
handle these in lsx, there should be no problem with leaving them in the
monolingual dictionaries.  These are not the forms that could cause "take
precautions" problems anyway.

After conducting this change, when secondary tags become available, things
would be set up to begin to leverage them.

--
Jonathan

On Tue, May 26, 2020, 07:27 Kevin Brubeck Unhammer <unham...@fsfe.org>
wrote:

> Xavi Ivars <xavi.iv...@gmail.com> čálii:
>
> > * In the trimming disadvantages number 1, we're stating that we're OK
> > having crappy monodixes because we *fix* that later on with trimming. I'm
> > sure that's where we are now, but as a project that focuses a lot on
> > provided free (as in speech) language resources that are later used for
> > many other use cases, I don't feel comfortable with that status. I think
> we
> > should aim to have as correct as possible dictionaries. And if we did
> that,
> > disadvantage number 1 would be smaller (even if not disappearing
> > completely).
>
> This point seems like distraction. No one puts errors in monodix on
> purpose. We do fix errors in monodix (when we find them, and have
> time). When we use monodix for other tasks than MT, we find and fix even
> more. On the other hand, there's no point in manually going through
> every monodix and bloody well searching for errors because there may be
> some that may show up if you stop trimming – please spend your time on
> something more useful.
>
> But there may also be some confusion as to what is an error. There may
> be things in monodixes that don't belong in "regular" dictionaries, but
> do belong in monodix – because the goal is building MT systems, not
> Dictionaries.
>
> And if your monodix is to be used for other things than MT, you're just
> gonna get many more such "weird" entries that all other use-cases need
> to filter out. E.g. Giellatekno's Northern Saami analyser (used for MT,
> spelling, grammar check etc.) contains several non-normative analyses,
> "multiwords" and unusual taggings just for the grammar checker. These
> are not included in the FST's built for other use-cases, but are trimmed
> out, mostly using tags (but also bidix, in the case of MT).
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

Reply via email to