Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tanmai Khanna Sun, 29 Mar 2020 00:23:00 -0700

Hi Hèctor,
A fundamental motivation for this proposal is the possibility of giving the
power to each program to use and propagate as much information as it needs
in the pipeline. In our discussion on the IRC, Tino Didriksen said:


> You should see how much secondary information VISL's streams have. Noun
> semantics, verb frames, dependency, markup tags, etc. Being able to carry
> any information along makes many things possible, often things you can't
> imagine because of current limitations.
>

The example in my original was well, just an example, but the idea is that
you can add any amount of information as you want, in the language models
or even the translation modules. It's also not just about English words,
but about all languages. This is not to say that we have to add case
information to every English word. It is optional information which can be
added if needed for the translation task.

With this proposal we're trying to prepare the apertium stream for the
future. Today we realised that we need the surface form in the stream, and
tomorrow we might need semantic tags, sentiment tags, etc.* If we don't do
this now, we will have to modify all the parsers in the pipeline each time
we need more information in the pipe.* This is why it's a good idea to
modify the parsers so that it can handle an arbitrary amount of information.

Lastly, one point we should discuss is this idea about how any secondary
information I add in the monodix would be available for everyone who uses
that information. There's several things to say about this:

   - As long as the information is correct, I don't really see why
   redundant secondary information should bother anyone. It will be available
   for anyone who wishes to use it for their task, and if you don't want to
   use it the programs will ignore it.
   - Another idea is that secondary information could be put in a separate
   dix, however this would lead to an unnecessary increase in complexity.

Unless if the developers of Apertium feel that redundant information in the
stream will be a huge problem, this will allow each program to access a lot
more information and open up possibilities that we haven't even thought of
yet. At the very least, it will help us to eliminate trimming.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font <[email protected]>
wrote:

> Hi Tanmai,
>
> I am surprised by this proposal. It involves some very important changes
> that should be better justified. I don't quite understand when should one
> define the "optional secondary information" in addition to the current
> morphological fields. Will it be in the language module (apertium-xxx) or
> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
> may be in the example. I can't imagine why information on case should be
> added to every English word (not much that, say, information about
> belonging, which is common for Turkic languages). Should this kind of
> unnecessary information for everybody, or almost everybody, will be found
> in every language pair using, say, English if someone for his or her
> specific purposes will like to add it? As far as I understand, for the
> given project it is needed to add the surface form of the word. This seems
> quite logical. Moreover, this information may be useful for e.g. lexical
> selection and structural transfer. But more than that seems to me too
> obscure.
>
> Best,
> Hèctor
>
> Missatge de Tanmai Khanna <[email protected]> del dia ds., 28 de
> març 2020 a les 23:51:
>
>> Hey guys,
>> As part of the project to eliminate trimming, I had to come up with a way
>> to include the surface form in the lexical unit and hence modifying the
>> apertium stream format. To do this I would have to modify the parsers of
>> every program in the pipeline, and if that has to happen, we discussed on
>> the IRC that *it might be a good idea to modify the stream in such a way
>> that we can include an arbitrary amount of information in a lexical unit,
>> and each program can use whatever information they need.*
>>
>> The current information in the lexical unit would be primary information,
>> and then we would have optional secondary information which could contain
>> the surface form, but also literally anything you can think of (case,
>> sentiment, pragmatic info, etc.). This would open up a lot of possibilities
>> for each program, and it would strengthen the apertium stream format
>> considerably.
>>
>> We discussed several possible syntax for this new stream format, and the
>> one that seems the best is something like this:
>>
>> ^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$
>>
>> This doesn't mess with the current stream format too much. The number of
>> tags is already arbitrary so that helps. The secondary tags contain a ":"
>> that would help distinguish them from primary tags.
>>
>> To implement this a modification would still be needed to all the parsers
>> but the benefits far outweigh the amount of work needed to pull this off.
>>
>> Since this would be a major fundamental change to Apertium, I request you
>> all to contribute with your views, any pros, cons, suggestions - to the
>> idea, to the syntax, anything.
>>
>> Thanks and Regards,
>> Tanmai Khanna
>>
>> --
>> *Khanna, Tanmai*
>> _______________________________________________
>> Apertium-stuff mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to