Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

Jonathan Washington Tue, 26 May 2020 08:35:02 -0700

On Tue, May 26, 2020, 08:48 Francis Tyers <fty...@prompsit.com> wrote:


> El 2020-05-26 12:27, Kevin Brubeck Unhammer escribió:
> > Xavi Ivars <xavi.iv...@gmail.com> čálii:
> >
> >> * In the trimming disadvantages number 1, we're stating that we're OK
> >> having crappy monodixes because we *fix* that later on with trimming.
> >> I'm
> >> sure that's where we are now, but as a project that focuses a lot on
> >> provided free (as in speech) language resources that are later used
> >> for
> >> many other use cases, I don't feel comfortable with that status. I
> >> think we
> >> should aim to have as correct as possible dictionaries. And if we did
> >> that,
> >> disadvantage number 1 would be smaller (even if not disappearing
> >> completely).
> >
> > This point seems like distraction. No one puts errors in monodix on
> > purpose. We do fix errors in monodix (when we find them, and have
> > time). When we use monodix for other tasks than MT, we find and fix
> > even
> > more. On the other hand, there's no point in manually going through
> > every monodix and bloody well searching for errors because there may be
> > some that may show up if you stop trimming – please spend your time on
> > something more useful.
> >
> > But there may also be some confusion as to what is an error. There may
> > be things in monodixes that don't belong in "regular" dictionaries, but
> > do belong in monodix – because the goal is building MT systems, not
> > Dictionaries.
> >
> > And if your monodix is to be used for other things than MT, you're just
> > gonna get many more such "weird" entries that all other use-cases need
> > to filter out. E.g. Giellatekno's Northern Saami analyser (used for MT,
> > spelling, grammar check etc.) contains several non-normative analyses,
> > "multiwords" and unusual taggings just for the grammar checker. These
> > are not included in the FST's built for other use-cases, but are
> > trimmed
> > out, mostly using tags (but also bidix, in the case of MT).
> >
>
> A better way of doing this kind of "lexicographic" work would be useful,
> in
> .lexc-based analysers we mostly use comments, but they are very ad-hoc.
> Some
> examples:
>
> ! Use/MT            - Only use this in MT systems
> ! Src/Bible         - This word came from the Bible
> ! Err/Orth          - Orthographic error
> ! Dial/North        - Northern variant
> ! Use/kaz-kir       - Only use this is kaz-kir
> ! Use/Circ          - This causes a cycle
> ! Dir/LR            - Only analysis
> ! Dir/RL            - Only generation
> ! Use/MWE           - Multiword
> ! Der/Caus          - Derived form by causative
> ! Use/Arch          - Archaic form
>
> Fran
>

Another problem with these comments is that we don't use most of them for
anything.

In particular, Use/MT line should be stripped out to produce vanilla
transducers, but I don't think we've ever done that.

This isn't a problem inherent to the methodology—just our inability to get
organised enough to use it for everything we dreamt it might be useful for.

--
Jonathan



>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

Reply via email to