Folks:

The elders in Apertium will not be surprised if I voiced my opposition to changing the format in the Apertium formats used between different modules of the pipeline. In any case, this is affects the core functionality of Apertium in many ways and its need should be justified in an uncontestable way so that the PMC makes a decision to have a new version of Apertium which should inevitably have paths to backward compatibility so that legacy languages and language pairs work identically and without any loss of performance. I believe we are far from "uncontestability", but that is just my personal opinion.

Currently, modes are linear pipelines. Any functionality requiring information that is currently ingested by one module and not passed ahead could be sent to later modules by teeing (https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would have a directed acyclic graph, much as in tools like make, snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.

Any justification should first prove that the functionality is robust and needed by working around the current format and modules, and be presented in a level of formality which is comparable to that used currently in our documentation.

Having said that, no one cannot oppose people forking and testing. If the new thing works, Apertium could bless the fork and merge it (depending on how the fork handles provisions for legacy Apertium workflows). But, as I said, this seems premature to me. But I am usually very conservative.

Cheers

Mikel


El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:
Hey guys,
Here's a draft proposal <I've completed a draft of the proposal. It's almost done except the work plan. Could you check it out? http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming> for this project. Any comments will be appreciated :)

Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna <khanna.tan...@gmail.com <mailto:khanna.tan...@gmail.com>> wrote:

    Hi Hèctor,
    A fundamental motivation for this proposal is the possibility of
    giving the power to each program to use and propagate as much
    information as it needs in the pipeline. In our discussion on the
    IRC, Tino Didriksen said:

        You should see how much secondary information VISL's streams
        have. Noun semantics, verb frames, dependency, markup tags,
        etc. Being able to carry any information along makes many
        things possible, often things you can't imagine because of
        current limitations.


    The example in my original was well, just an example, but the idea
    is that you can add any amount of information as you want, in the
    language models or even the translation modules. It's also not
    just about English words, but about all languages. This is not to
    say that we have to add case information to every English word. It
    is optional information which can be added if needed for the
    translation task.

    With this proposal we're trying to prepare the apertium stream for
    the future. Today we realised that we need the surface form in the
    stream, and tomorrow we might need semantic tags, sentiment tags,
    etc.*If we don't do this now, we will have to modify all the
    parsers in the pipeline each time we need more information in the
    pipe.* This is why it's a good idea to modify the parsers so that
    it can handle an arbitrary amount of information.

    Lastly, one point we should discuss is this idea about how any
    secondary information I add in the monodix would be available for
    everyone who uses that information. There's several things to say
    about this:

      * As long as the information is correct, I don't really see why
        redundant secondary information should bother anyone. It will
        be available for anyone who wishes to use it for their task,
        and if you don't want to use it the programs will ignore it.
      * Another idea is that secondary information could be put in a
        separate dix, however this would lead to an unnecessary
        increase in complexity.

    Unless if the developers of Apertium feel that redundant
    information in the stream will be a huge problem, this will allow
    each program to access a lot more information and open up
    possibilities that we haven't even thought of yet. At the very
    least, it will help us to eliminate trimming.

    Thanks and Regards,
    Tanmai Khanna

    On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font
    <hectora...@gmail.com <mailto:hectora...@gmail.com>> wrote:

        Hi Tanmai,

        I am surprised by this proposal. It involves some very
        important changes that should be better justified. I don't
        quite understand when should one define the "optional
        secondary information" in addition to the current
        morphological fields. Will it be in the language module
        (apertium-xxx) or in each of the translation modules
        (apertium-xxx-yyy)? Part of the problem may be in the example.
        I can't imagine why information on case should be added to
        every English word (not much that, say, information about
        belonging, which is common for Turkic languages). Should this
        kind of unnecessary information for everybody, or almost
        everybody, will be found in every language pair using, say,
        English if someone for his or her specific purposes will like
        to add it? As far as I understand, for the given project it is
        needed to add the surface form of the word. This seems quite
        logical. Moreover, this information may be useful for e.g.
        lexical selection and structural transfer. But more than that
        seems to me too obscure.

        Best,
        Hèctor

        Missatge de Tanmai Khanna <khanna.tan...@gmail.com
        <mailto:khanna.tan...@gmail.com>> del dia ds., 28 de març 2020
        a les 23:51:

            Hey guys,
            As part of the project to eliminate trimming, I had to
            come up with a way to include the surface form in the
            lexical unit and hence modifying the apertium stream
            format. To do this I would have to modify the parsers of
            every program in the pipeline, and if that has to happen,
            we discussed on the IRC that *it might be a good idea to
            modify the stream in such a way that we can include an
            arbitrary amount of information in a lexical unit, and
            each program can use whatever information they need.*

            The current information in the lexical unit would be
            primary information, and then we would have optional
            secondary information which could contain the surface
            form, but also literally anything you can think of (case,
            sentiment, pragmatic info, etc.). This would open up a lot
            of possibilities for each program, and it would
            strengthen the apertium stream format considerably.

            We discussed several possible syntax for this new stream
            format, and the one that seems the best is something like
            this:
            
^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

            This doesn't mess with the current stream format too much.
            The number of tags is already arbitrary so that helps. The
            secondary tags contain a ":" that would help distinguish
            them from primary tags.

            To implement this a modification would still be needed to
            all the parsers but the benefits far outweigh the amount
            of work needed to pull this off.

            Since this would be a major fundamental change to
            Apertium, I request you all to contribute with your views,
            any pros, cons, suggestions - to the idea, to the syntax,
            anything.

            Thanks and Regards,
            Tanmai Khanna
-- *Khanna, Tanmai*
            _______________________________________________
            Apertium-stuff mailing list
            Apertium-stuff@lists.sourceforge.net
            <mailto:Apertium-stuff@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/apertium-stuff

        _______________________________________________
        Apertium-stuff mailing list
        Apertium-stuff@lists.sourceforge.net
        <mailto:Apertium-stuff@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/apertium-stuff



-- *Khanna, Tanmai*



--
*Khanna, Tanmai*


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to