> * In the trimming disadvantages number 1, we're stating that we're OK having crappy monodixes because we *fix* that later on with trimming. I'm sure that's where we are >now, but as a project that focuses a lot on provided free (as in speech) language resources that are later used for many other use cases, I don't feel comfortable with that >status. I think we should aim to have as correct as possible dictionaries. And if we did that, disadvantage number 1 would be smaller (even if not disappearing completely).
> I think the only argument here is that we want to keep having bad stuff >in monodixes. From software engineering standpoint I find this argument >really problematic, to hinder further development of systems because we >want to keep bad, low quality data in monodixes is not good. As I've >curated and maintained a bunch of stuff though, I can relate to the >sentiment, linguistic data collection including dictionaries is not >really a software project that will have complete and correct version >1.0. But I do think apertium does need to move towards maybe more >quality control, more continuous testing for monodixes, especially of >the esteemed release quality languages. > This is critically important, in my opinion. Languages should be stand-alone and widely usable for many purposes. As I wrote on IRC, this a luxury problem. If the source >analysis is bad, bloody well fix it so that all pairs, spell checker, and corpus work can take advantage. Don't let it remain a task for the pairs. It's true that our tools should take into account the imperfectness of the resources that they're working with, and if controlling monodixes through the bidix was an acceptable design choice, I'd agree with it as well. But through several views, it's clear that either all monodixes aren't crappy, or if they are, we aren't necessarily okay with them staying in that state. Apertium doesn't just offer translation services but also morphological analysers for several languages, and the people who use our morph analysed data may not be expecting incorrect and erroneous analyses in it just because we have found a way to have a check for it through the bidix. Having said that, it's easier said than done, and hence I feel like even if this doesn't lead to a change in existing pairs, it could give us a desired direction for the future. The MWE problem has been discussed and it has been agreed that either we could keep partial trimming for MWEs or could gradually shift them to -separable, which can trim them accordingly. As for the compile time increase, as detailed in a previous mail, just lt-trim takes ~4 seconds and a weighted dictionary takes ~8 seconds. I guess these facts are enough for us to at least provide our language developers with the option to disable trimming, and by propagating the surface form through secondary tags, it's what I hope to make a reality. Thanks for all your comments. :) Regards, Tanmai Khanna On Mon, May 25, 2020 at 11:02 PM Tanmai Khanna <khanna.tan...@gmail.com> wrote: > Here's a timing test for weighted dictionaries. > On apertium-eng-kaz: > > 1. lt-trim analyser.bin bidix.bin analyser-found.bin > Time: > > real 0m4.257s > > user 0m4.120s > > sys 0m0.131s > > > 2. > > lt-trim analyser.bin bidix.bin analyser-found.bin > > lt-print -H analyser.bin > analyser.att > > lt-print -H analyser-found.bin > analyser-found.att > > hfst-txt2fst -e ε analyser.att -o analyser.hfst > > hfst-txt2fst -e ε analyser-found.att -o analyser-found.hfst > > hfst-subtract -1 analyser.hfst -2 analyser-found.hfst -o > analyser-unfound.hfst > > hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst > > hfst-union -1 analyser-unfound.weighted.hfst -2 analyser-found.hfst -o > analyser.weighted.hfst > > hfst-fst2txt analyser.weighted.hfst -o analyser.weighted.att > > lt-comp lr analyser.weighted.att analyser.weighted.bin > > > Time: > > real 0m7.990s > > user 0m7.227s > > sys 0m0.730s > > > Tanmai > > On Mon, May 25, 2020 at 10:58 PM Samuel Sloniker <scoopgra...@gmail.com> > wrote: > >> Maybe make trimming the default, but make apertium-init disable it for >> new pairs? >> >> On Mon, May 25, 2020, 10:01 Tino Didriksen <m...@tinodidriksen.com> >> wrote: >> >>> On Mon, 25 May 2020 at 12:29, Xavi Ivars <xavi.iv...@gmail.com> wrote: >>> >>>> * In the trimming disadvantages number 1, we're stating that we're OK >>>> having crappy monodixes because we *fix* that later on with trimming. >>>> I'm sure that's where we are now, but as a project that focuses a lot on >>>> provided free (as in speech) language resources that are later used for >>>> many other use cases, I don't feel comfortable with that status. I think we >>>> should aim to have as correct as possible dictionaries. And if we did that, >>>> disadvantage number 1 would be smaller (even if not disappearing >>>> completely). >>>> >>> >>> This is critically important, in my opinion. Languages should be >>> stand-alone and widely usable for many purposes. As I wrote on IRC, this a >>> luxury problem. If the source analysis is bad, bloody well fix it so that >>> all pairs, spell checker, and corpus work can take advantage. Don't let it >>> remain a task for the pairs. >>> >>> The fact that trimming via bidix and target monodix is currently needed >>> is a historical accident. It should not be something developers rely on >>> going forward, and especially not for new pairs. >>> >>> -- Tino Didriksen >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > > > -- > *Khanna, Tanmai* > -- *Khanna, Tanmai*
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff