> * In the trimming disadvantages number 1, we're stating that we're OK
having crappy monodixes because we *fix* that later on with trimming. I'm
sure that's where we are >now, but as a project that focuses a lot on
provided free (as in speech) language resources that are later used for
many other use cases, I don't feel comfortable with that >status. I think
we should aim to have as correct as possible dictionaries. And if we did
that, disadvantage number 1 would be smaller (even if not disappearing
completely).

> I think the only argument here is that we want to keep having bad stuff
>in monodixes. From software engineering standpoint I find this argument
>really problematic, to hinder further development of systems because we
>want to keep bad, low quality data in monodixes is not good. As I've
>curated and maintained a bunch of stuff though, I can relate to the
>sentiment, linguistic data collection including dictionaries is not
>really a software project that will have complete and correct version
>1.0. But I do think apertium does need to move towards maybe more
>quality control, more continuous testing for monodixes, especially of
>the esteemed release quality languages.

> This is critically important, in my opinion. Languages should be
stand-alone and widely usable for many purposes. As I wrote on IRC, this a
luxury problem. If the source >analysis is bad, bloody well fix it so that
all pairs, spell checker, and corpus work can take advantage. Don't let it
remain a task for the pairs.

It's true that our tools should take into account the imperfectness of the
resources that they're working with, and if controlling monodixes through
the bidix was an acceptable design choice, I'd agree with it as well. But
through several views, it's clear that either all monodixes aren't crappy,
or if they are, we aren't necessarily okay with them staying in that state.
Apertium doesn't just offer translation services but also morphological
analysers for several languages, and the people who use our morph analysed
data may not be expecting incorrect and erroneous analyses in it just
because we have found a way to have a check for it through the bidix.
Having said that, it's easier said than done, and hence I feel like even if
this doesn't lead to a change in existing pairs, it could give us a desired
direction for the future.

The MWE problem has been discussed and it has been agreed that either we
could keep partial trimming for MWEs or could gradually shift them to
-separable, which can trim them accordingly.

As for the compile time increase, as detailed in a previous mail, just
lt-trim takes ~4 seconds and a weighted dictionary takes ~8 seconds.

I guess these facts are enough for us to at least provide our language
developers with the option to disable trimming, and by propagating the
surface form through secondary tags, it's what I hope to make a reality.

Thanks for all your comments. :)

Regards,
Tanmai Khanna


On Mon, May 25, 2020 at 11:02 PM Tanmai Khanna <khanna.tan...@gmail.com>
wrote:

> Here's a timing test for weighted dictionaries.
> On apertium-eng-kaz:
>
> 1. lt-trim analyser.bin bidix.bin analyser-found.bin
> Time:
>
> real 0m4.257s
>
> user 0m4.120s
>
> sys 0m0.131s
>
>
> 2.
>
> lt-trim analyser.bin bidix.bin analyser-found.bin
>
> lt-print -H analyser.bin > analyser.att
>
> lt-print -H analyser-found.bin > analyser-found.att
>
> hfst-txt2fst -e ε analyser.att -o analyser.hfst
>
> hfst-txt2fst -e ε analyser-found.att -o analyser-found.hfst
>
> hfst-subtract -1 analyser.hfst -2 analyser-found.hfst -o
> analyser-unfound.hfst
>
> hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst
>
> hfst-union -1 analyser-unfound.weighted.hfst -2 analyser-found.hfst -o
> analyser.weighted.hfst
>
> hfst-fst2txt analyser.weighted.hfst -o analyser.weighted.att
>
> lt-comp lr analyser.weighted.att analyser.weighted.bin
>
>
> Time:
>
> real 0m7.990s
>
> user 0m7.227s
>
> sys 0m0.730s
>
>
> Tanmai
>
> On Mon, May 25, 2020 at 10:58 PM Samuel Sloniker <scoopgra...@gmail.com>
> wrote:
>
>> Maybe make trimming the default, but make apertium-init disable it for
>> new pairs?
>>
>> On Mon, May 25, 2020, 10:01 Tino Didriksen <m...@tinodidriksen.com>
>> wrote:
>>
>>> On Mon, 25 May 2020 at 12:29, Xavi Ivars <xavi.iv...@gmail.com> wrote:
>>>
>>>> * In the trimming disadvantages number 1, we're stating that we're OK
>>>> having crappy monodixes because we *fix* that later on with trimming.
>>>> I'm sure that's where we are now, but as a project that focuses a lot on
>>>> provided free (as in speech) language resources that are later used for
>>>> many other use cases, I don't feel comfortable with that status. I think we
>>>> should aim to have as correct as possible dictionaries. And if we did that,
>>>> disadvantage number 1 would be smaller (even if not disappearing
>>>> completely).
>>>>
>>>
>>> This is critically important, in my opinion. Languages should be
>>> stand-alone and widely usable for many purposes. As I wrote on IRC, this a
>>> luxury problem. If the source analysis is bad, bloody well fix it so that
>>> all pairs, spell checker, and corpus work can take advantage. Don't let it
>>> remain a task for the pairs.
>>>
>>> The fact that trimming via bidix and target monodix is currently needed
>>> is a historical accident. It should not be something developers rely on
>>> going forward, and especially not for new pairs.
>>>
>>> -- Tino Didriksen
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
> --
> *Khanna, Tanmai*
>


-- 
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to