The main thing I worry about here is lrx rules. Currently a lot of pairs have rules that match e.g. tags="adj", but not necessarily tags="adj.*". So something that's normally hargle<adj> might now be hargle<adj><sf:hargle>, and that means the lrx rule won't match.
Since we want this to be backwards-compatible (without rewriting rules), the lrx compiler and/or processor will have to be rewritten to ignore secondary tags for matching (unless a rule is written to check a secondary tag??). I guess this sort of worry is the sort of thing you're keeping track of so that it can be worked on? -- Jonathan On Mon, Apr 20, 2020, 14:52 Tanmai Khanna <khanna.tan...@gmail.com> wrote: > In a nutshell, by using the source analysis for disambiguation and > transfer, we make the translation output better, and by outputting the > source surface form instead of the source lemma, we make the output more > comprehensible, or post-editable. > > Tanmai > > On Tue, Apr 21, 2020 at 12:19 AM Tanmai Khanna <khanna.tan...@gmail.com> > wrote: > >> Hey Francis, >> I agree that it does seem like a solution searching for a problem if we >> look at it in isolation. But it's important to look at this in the context >> of eliminating trimming. Chronologically, this project was first about and >> still is, about eliminating dictionary trimming. Modification to the stream >> is just part of the solution - a solution that will help this problem, but >> also potentially several other problems, such as the superblank reordering >> problem. I went into detail about this in the proposal but I'll explain it >> here. >> >> The monodix of a language is generally larger than the bidix for a >> language pair involving that language pair. It was noticed that if used as >> is, there are a lot of translation errors (the ones with @), which >> basically just put the lemma of the source language if a translation >> isnt available. To deal with this, dictionary trimming was added, which >> basically removed a word from the monodix if it wasn't present in the bidix >> and it went through the pipeline as an unknown word and the source surface >> form was found in the final translation (with a *), which is arguably >> better and more intelligible than just the source lemma. >> >> However, trimming meant giving up certain benefits. Let's look at these >> benefits in greater detail: >> >> - *Lexical Selection:* By discarding the analysis of a word in the >> source language, we lose the ability to use it as context to disambiguate >> words in its context. Assume a [Noun Adjective] in which the we don't know >> the translation of the Adjective, i.e. it isn't in the bidix. With >> trimming >> we would discard it and hence if the Noun has several ambiguous forms, we >> have no way to disambiguate it since we've discarded the analysis of the >> Adjective (which included the fact that it's an adjective) >> - *Transfer:* In the same example, assume that in the target >> language, [Noun Adj] is to be rearranged into [Adj Noun]. With trimming, >> this can't be done as we've discarded the analysis of the Adjective, >> treating it as an unknown word. >> >> Now, if we don't discard the analysis and don't trim, we would again fall >> into the earlier problem of untranslated lemmas. >> >> This project, is a way to have our cake and eat it too. We don't discard >> the analysis even if we don't know the translation, but we don't just >> output the lemma either - we output the source surface form. For a solution >> like this, it is *essential that we propagate the surface form till at >> least transfer or even till the generator*, so that we can use the >> benefits of the source analysis and then before translation, we discard it >> and use the source surface form. >> >> Currently the source surface form is discarded at the tagger. This is >> where the stream modification comes in. It's a robust way to propagate the >> surface form through the stream with least disruption to the current >> modules. >> >> Then there are other possible benefits of secondary information, such as >> markup tags. Hope this makes sense. >> >> Tanmai >> >> On Tue, Apr 21, 2020 at 12:02 AM Francis Tyers <fty...@prompsit.com> >> wrote: >> >>> El 2020-04-20 19:21, Daniel Swanson escribió: >>> >> Another way of putting this is that it looks like a technical >>> > solution >>> >> in search of a problem, rather than a problem description in search >>> >> of a solution. >>> > >>> > To me the most obvious thing to do with it is to put markup >>> > information in secondary tags as a way of solving the superblank >>> > reordering problem. >>> > >>> >>> Didn't we have a solution for this that was worked on over a couple >>> of GSOC projects ? >>> >>> Fran >>> >>> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> >> >> -- >> *Khanna, Tanmai* >> > > > -- > *Khanna, Tanmai* > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff