Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Mikel L. Forcada Sun, 29 Mar 2020 03:22:14 -0700

Folks:

The elders in Apertium will not be surprised if I voiced my oppositionto changing the format in the Apertium formats used between differentmodules of the pipeline. In any case, this is affects the corefunctionality of Apertium in many ways and its need should be justifiedin an uncontestable way so that the PMC makes a decision to have a newversion of Apertium which should inevitably have paths to backwardcompatibility so that legacy languages and language pairs workidentically and without any loss of performance. I believe we are farfrom "uncontestability", but that is just my personal opinion.

Currently, modes are linear pipelines. Any functionality requiringinformation that is currently ingested by one module and not passedahead could be sent to later modules by teeing(https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. Wewould have a directed acyclic graph, much as in tools like make,snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.

Any justification should first prove that the functionality is robustand needed by working around the current format and modules, and bepresented in a level of formality which is comparable to that usedcurrently in our documentation.

Having said that, no one cannot oppose people forking and testing. Ifthe new thing works, Apertium could bless the fork and merge it(depending on how the fork handles provisions for legacy Apertiumworkflows). But, as I said, this seems premature to me. But I am usuallyvery conservative.


Cheers

Mikel


El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:

Hey guys,

Here's a draft proposal <I've completed a draft of the proposal. It'salmost done except the work plan. Could you check it out?http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming>for this project. Any comments will be appreciated :)


Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna<khanna.tan...@gmail.com <mailto:khanna.tan...@gmail.com>> wrote:

Hi Hèctor,
A fundamental motivation for this proposal is the possibility of
giving the power to each program to use and propagate as much
information as it needs in the pipeline. In our discussion on the
IRC, Tino Didriksen said:

You should see how much secondary information VISL's streams
have. Noun semantics, verb frames, dependency, markup tags,
etc. Being able to carry any information along makes many
things possible, often things you can't imagine because of
current limitations.

The example in my original was well, just an example, but the idea
is that you can add any amount of information as you want, in the
language models or even the translation modules. It's also not
just about English words, but about all languages. This is not to
say that we have to add case information to every English word. It
is optional information which can be added if needed for the
translation task.

With this proposal we're trying to prepare the apertium stream for
the future. Today we realised that we need the surface form in the
stream, and tomorrow we might need semantic tags, sentiment tags,
etc.*If we don't do this now, we will have to modify all the
parsers in the pipeline each time we need more information in the
pipe.* This is why it's a good idea to modify the parsers so that
it can handle an arbitrary amount of information.

Lastly, one point we should discuss is this idea about how any
secondary information I add in the monodix would be available for
everyone who uses that information. There's several things to say
about this:

* As long as the information is correct, I don't really see why
redundant secondary information should bother anyone. It will
be available for anyone who wishes to use it for their task,
and if you don't want to use it the programs will ignore it.
* Another idea is that secondary information could be put in a
separate dix, however this would lead to an unnecessary
increase in complexity.

Unless if the developers of Apertium feel that redundant
information in the stream will be a huge problem, this will allow
each program to access a lot more information and open up
possibilities that we haven't even thought of yet. At the very
least, it will help us to eliminate trimming.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font
<hectora...@gmail.com <mailto:hectora...@gmail.com>> wrote:

Hi Tanmai,

I am surprised by this proposal. It involves some very
important changes that should be better justified. I don't
quite understand when should one define the "optional
secondary information" in addition to the current
morphological fields. Will it be in the language module
(apertium-xxx) or in each of the translation modules
(apertium-xxx-yyy)? Part of the problem may be in the example.
I can't imagine why information on case should be added to
every English word (not much that, say, information about
belonging, which is common for Turkic languages). Should this
kind of unnecessary information for everybody, or almost
everybody, will be found in every language pair using, say,
English if someone for his or her specific purposes will like
to add it? As far as I understand, for the given project it is
needed to add the surface form of the word. This seems quite
logical. Moreover, this information may be useful for e.g.
lexical selection and structural transfer. But more than that
seems to me too obscure.

Best,
Hèctor

Missatge de Tanmai Khanna <khanna.tan...@gmail.com
<mailto:khanna.tan...@gmail.com>> del dia ds., 28 de març 2020
a les 23:51:

Hey guys,
As part of the project to eliminate trimming, I had to
come up with a way to include the surface form in the
lexical unit and hence modifying the apertium stream
format. To do this I would have to modify the parsers of
every program in the pipeline, and if that has to happen,
we discussed on the IRC that *it might be a good idea to
modify the stream in such a way that we can include an
arbitrary amount of information in a lexical unit, and
each program can use whatever information they need.*

The current information in the lexical unit would be
primary information, and then we would have optional
secondary information which could contain the surface
form, but also literally anything you can think of (case,
sentiment, pragmatic info, etc.). This would open up a lot
of possibilities for each program, and it would
strengthen the apertium stream format considerably.

We discussed several possible syntax for this new stream
format, and the one that seems the best is something like
this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

This doesn't mess with the current stream format too much.
The number of tags is already arbitrary so that helps. The
secondary tags contain a ":" that would help distinguish
them from primary tags.

To implement this a modification would still be needed to
all the parsers but the benefits far outweigh the amount
of work needed to pull this off.

Since this would be a major fundamental change to
Apertium, I request you all to contribute with your views,
any pros, cons, suggestions - to the idea, to the syntax,
anything.

Thanks and Regards,
Tanmai Khanna

--*Khanna, Tanmai*

            _______________________________________________
            Apertium-stuff mailing list
            Apertium-stuff@lists.sourceforge.net
            <mailto:Apertium-stuff@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/apertium-stuff

        _______________________________________________
        Apertium-stuff mailing list
        Apertium-stuff@lists.sourceforge.net
        <mailto:Apertium-stuff@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/apertium-stuff

--*Khanna, Tanmai*




--
*Khanna, Tanmai*


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to