On Tue, May 26, 2020, 08:48 Francis Tyers <fty...@prompsit.com> wrote:
> El 2020-05-26 12:27, Kevin Brubeck Unhammer escribió: > > Xavi Ivars <xavi.iv...@gmail.com> čálii: > > > >> * In the trimming disadvantages number 1, we're stating that we're OK > >> having crappy monodixes because we *fix* that later on with trimming. > >> I'm > >> sure that's where we are now, but as a project that focuses a lot on > >> provided free (as in speech) language resources that are later used > >> for > >> many other use cases, I don't feel comfortable with that status. I > >> think we > >> should aim to have as correct as possible dictionaries. And if we did > >> that, > >> disadvantage number 1 would be smaller (even if not disappearing > >> completely). > > > > This point seems like distraction. No one puts errors in monodix on > > purpose. We do fix errors in monodix (when we find them, and have > > time). When we use monodix for other tasks than MT, we find and fix > > even > > more. On the other hand, there's no point in manually going through > > every monodix and bloody well searching for errors because there may be > > some that may show up if you stop trimming – please spend your time on > > something more useful. > > > > But there may also be some confusion as to what is an error. There may > > be things in monodixes that don't belong in "regular" dictionaries, but > > do belong in monodix – because the goal is building MT systems, not > > Dictionaries. > > > > And if your monodix is to be used for other things than MT, you're just > > gonna get many more such "weird" entries that all other use-cases need > > to filter out. E.g. Giellatekno's Northern Saami analyser (used for MT, > > spelling, grammar check etc.) contains several non-normative analyses, > > "multiwords" and unusual taggings just for the grammar checker. These > > are not included in the FST's built for other use-cases, but are > > trimmed > > out, mostly using tags (but also bidix, in the case of MT). > > > > A better way of doing this kind of "lexicographic" work would be useful, > in > .lexc-based analysers we mostly use comments, but they are very ad-hoc. > Some > examples: > > ! Use/MT - Only use this in MT systems > ! Src/Bible - This word came from the Bible > ! Err/Orth - Orthographic error > ! Dial/North - Northern variant > ! Use/kaz-kir - Only use this is kaz-kir > ! Use/Circ - This causes a cycle > ! Dir/LR - Only analysis > ! Dir/RL - Only generation > ! Use/MWE - Multiword > ! Der/Caus - Derived form by causative > ! Use/Arch - Archaic form > > Fran > Another problem with these comments is that we don't use most of them for anything. In particular, Use/MT line should be stripped out to produce vanilla transducers, but I don't think we've ever done that. This isn't a problem inherent to the methodology—just our inability to get organised enough to use it for everything we dreamt it might be useful for. -- Jonathan > > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff