Hey guys, Here's a draft proposal <I've completed a draft of the proposal. It's almost done except the work plan. Could you check it out? http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming> for this project. Any comments will be appreciated :)
Thanks, Tanmai On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna <khanna.tan...@gmail.com> wrote: > Hi Hèctor, > A fundamental motivation for this proposal is the possibility of giving > the power to each program to use and propagate as much information as it > needs in the pipeline. In our discussion on the IRC, Tino Didriksen said: > >> You should see how much secondary information VISL's streams have. Noun >> semantics, verb frames, dependency, markup tags, etc. Being able to carry >> any information along makes many things possible, often things you can't >> imagine because of current limitations. >> > > The example in my original was well, just an example, but the idea is that > you can add any amount of information as you want, in the language models > or even the translation modules. It's also not just about English words, > but about all languages. This is not to say that we have to add case > information to every English word. It is optional information which can be > added if needed for the translation task. > > With this proposal we're trying to prepare the apertium stream for the > future. Today we realised that we need the surface form in the stream, and > tomorrow we might need semantic tags, sentiment tags, etc.* If we don't > do this now, we will have to modify all the parsers in the pipeline each > time we need more information in the pipe.* This is why it's a good idea > to modify the parsers so that it can handle an arbitrary amount of > information. > > Lastly, one point we should discuss is this idea about how any secondary > information I add in the monodix would be available for everyone who uses > that information. There's several things to say about this: > > - As long as the information is correct, I don't really see why > redundant secondary information should bother anyone. It will be available > for anyone who wishes to use it for their task, and if you don't want to > use it the programs will ignore it. > - Another idea is that secondary information could be put in a > separate dix, however this would lead to an unnecessary increase in > complexity. > > Unless if the developers of Apertium feel that redundant information in > the stream will be a huge problem, this will allow each program to access a > lot more information and open up possibilities that we haven't even thought > of yet. At the very least, it will help us to eliminate trimming. > > Thanks and Regards, > Tanmai Khanna > > On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font <hectora...@gmail.com> > wrote: > >> Hi Tanmai, >> >> I am surprised by this proposal. It involves some very important changes >> that should be better justified. I don't quite understand when should one >> define the "optional secondary information" in addition to the current >> morphological fields. Will it be in the language module (apertium-xxx) or >> in each of the translation modules (apertium-xxx-yyy)? Part of the problem >> may be in the example. I can't imagine why information on case should be >> added to every English word (not much that, say, information about >> belonging, which is common for Turkic languages). Should this kind of >> unnecessary information for everybody, or almost everybody, will be found >> in every language pair using, say, English if someone for his or her >> specific purposes will like to add it? As far as I understand, for the >> given project it is needed to add the surface form of the word. This seems >> quite logical. Moreover, this information may be useful for e.g. lexical >> selection and structural transfer. But more than that seems to me too >> obscure. >> >> Best, >> Hèctor >> >> Missatge de Tanmai Khanna <khanna.tan...@gmail.com> del dia ds., 28 de >> març 2020 a les 23:51: >> >>> Hey guys, >>> As part of the project to eliminate trimming, I had to come up with a >>> way to include the surface form in the lexical unit and hence modifying the >>> apertium stream format. To do this I would have to modify the parsers of >>> every program in the pipeline, and if that has to happen, we discussed on >>> the IRC that *it might be a good idea to modify the stream in such a >>> way that we can include an arbitrary amount of information in a lexical >>> unit, and each program can use whatever information they need.* >>> >>> The current information in the lexical unit would be primary >>> information, and then we would have optional secondary information which >>> could contain the surface form, but also literally anything you can think >>> of (case, sentiment, pragmatic info, etc.). This would open up a lot of >>> possibilities for each program, and it would strengthen the apertium stream >>> format considerably. >>> >>> We discussed several possible syntax for this new stream format, and the >>> one that seems the best is something like this: >>> >>> ^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$ >>> >>> This doesn't mess with the current stream format too much. The number of >>> tags is already arbitrary so that helps. The secondary tags contain a ":" >>> that would help distinguish them from primary tags. >>> >>> To implement this a modification would still be needed to all the >>> parsers but the benefits far outweigh the amount of work needed to pull >>> this off. >>> >>> Since this would be a major fundamental change to Apertium, I request >>> you all to contribute with your views, any pros, cons, suggestions - to the >>> idea, to the syntax, anything. >>> >>> Thanks and Regards, >>> Tanmai Khanna >>> >>> -- >>> *Khanna, Tanmai* >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > > > -- > *Khanna, Tanmai* > -- *Khanna, Tanmai*
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff