Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tanmai Khanna Sun, 29 Mar 2020 02:43:00 -0700

Hey guys,
Here's a draft proposal <I've completed a draft of the proposal. It's
almost done except the work plan. Could you check it out?
http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming>
for this project. Any comments will be appreciated :)


Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna <[email protected]>
wrote:

> Hi Hèctor,
> A fundamental motivation for this proposal is the possibility of giving
> the power to each program to use and propagate as much information as it
> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said:
>
>> You should see how much secondary information VISL's streams have. Noun
>> semantics, verb frames, dependency, markup tags, etc. Being able to carry
>> any information along makes many things possible, often things you can't
>> imagine because of current limitations.
>>
>
> The example in my original was well, just an example, but the idea is that
> you can add any amount of information as you want, in the language models
> or even the translation modules. It's also not just about English words,
> but about all languages. This is not to say that we have to add case
> information to every English word. It is optional information which can be
> added if needed for the translation task.
>
> With this proposal we're trying to prepare the apertium stream for the
> future. Today we realised that we need the surface form in the stream, and
> tomorrow we might need semantic tags, sentiment tags, etc.* If we don't
> do this now, we will have to modify all the parsers in the pipeline each
> time we need more information in the pipe.* This is why it's a good idea
> to modify the parsers so that it can handle an arbitrary amount of
> information.
>
> Lastly, one point we should discuss is this idea about how any secondary
> information I add in the monodix would be available for everyone who uses
> that information. There's several things to say about this:
>
>    - As long as the information is correct, I don't really see why
>    redundant secondary information should bother anyone. It will be available
>    for anyone who wishes to use it for their task, and if you don't want to
>    use it the programs will ignore it.
>    - Another idea is that secondary information could be put in a
>    separate dix, however this would lead to an unnecessary increase in
>    complexity.
>
> Unless if the developers of Apertium feel that redundant information in
> the stream will be a huge problem, this will allow each program to access a
> lot more information and open up possibilities that we haven't even thought
> of yet. At the very least, it will help us to eliminate trimming.
>
> Thanks and Regards,
> Tanmai Khanna
>
> On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font <[email protected]>
> wrote:
>
>> Hi Tanmai,
>>
>> I am surprised by this proposal. It involves some very important changes
>> that should be better justified. I don't quite understand when should one
>> define the "optional secondary information" in addition to the current
>> morphological fields. Will it be in the language module (apertium-xxx) or
>> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
>> may be in the example. I can't imagine why information on case should be
>> added to every English word (not much that, say, information about
>> belonging, which is common for Turkic languages). Should this kind of
>> unnecessary information for everybody, or almost everybody, will be found
>> in every language pair using, say, English if someone for his or her
>> specific purposes will like to add it? As far as I understand, for the
>> given project it is needed to add the surface form of the word. This seems
>> quite logical. Moreover, this information may be useful for e.g. lexical
>> selection and structural transfer. But more than that seems to me too
>> obscure.
>>
>> Best,
>> Hèctor
>>
>> Missatge de Tanmai Khanna <[email protected]> del dia ds., 28 de
>> març 2020 a les 23:51:
>>
>>> Hey guys,
>>> As part of the project to eliminate trimming, I had to come up with a
>>> way to include the surface form in the lexical unit and hence modifying the
>>> apertium stream format. To do this I would have to modify the parsers of
>>> every program in the pipeline, and if that has to happen, we discussed on
>>> the IRC that *it might be a good idea to modify the stream in such a
>>> way that we can include an arbitrary amount of information in a lexical
>>> unit, and each program can use whatever information they need.*
>>>
>>> The current information in the lexical unit would be primary
>>> information, and then we would have optional secondary information which
>>> could contain the surface form, but also literally anything you can think
>>> of (case, sentiment, pragmatic info, etc.). This would open up a lot of
>>> possibilities for each program, and it would strengthen the apertium stream
>>> format considerably.
>>>
>>> We discussed several possible syntax for this new stream format, and the
>>> one that seems the best is something like this:
>>>
>>> ^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$
>>>
>>> This doesn't mess with the current stream format too much. The number of
>>> tags is already arbitrary so that helps. The secondary tags contain a ":"
>>> that would help distinguish them from primary tags.
>>>
>>> To implement this a modification would still be needed to all the
>>> parsers but the benefits far outweigh the amount of work needed to pull
>>> this off.
>>>
>>> Since this would be a major fundamental change to Apertium, I request
>>> you all to contribute with your views, any pros, cons, suggestions - to the
>>> idea, to the syntax, anything.
>>>
>>> Thanks and Regards,
>>> Tanmai Khanna
>>>
>>> --
>>> *Khanna, Tanmai*
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
> --
> *Khanna, Tanmai*
>


-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to