Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Jonathan Washington Tue, 21 Apr 2020 09:15:26 -0700

The main thing I worry about here is lrx rules.

Currently a lot of pairs have rules that match e.g. tags="adj", but not
necessarily tags="adj.*".  So something that's normally hargle<adj> might
now be hargle<adj><sf:hargle>, and that means the lrx rule won't match.


Since we want this to be backwards-compatible (without rewriting rules),
the lrx compiler and/or processor will have to be rewritten to ignore
secondary tags for matching (unless a rule is written to check a secondary
tag??).

I guess this sort of worry is the sort of thing you're keeping track of so
that it can be worked on?

--
Jonathan

On Mon, Apr 20, 2020, 14:52 Tanmai Khanna <khanna.tan...@gmail.com> wrote:

> In a nutshell, by using the source analysis for disambiguation and
> transfer, we make the translation output better, and by outputting the
> source surface form instead of the source lemma, we make the output more
> comprehensible, or post-editable.
>
> Tanmai
>
> On Tue, Apr 21, 2020 at 12:19 AM Tanmai Khanna <khanna.tan...@gmail.com>
> wrote:
>
>> Hey Francis,
>> I agree that it does seem like a solution searching for a problem if we
>> look at it in isolation. But it's important to look at this in the context
>> of eliminating trimming. Chronologically, this project was first about and
>> still is, about eliminating dictionary trimming. Modification to the stream
>> is just part of the solution - a solution that will help this problem, but
>> also potentially several other problems, such as the superblank reordering
>> problem. I went into detail about this in the proposal but I'll explain it
>> here.
>>
>> The monodix of a language is generally larger than the bidix for a
>> language pair involving that language pair. It was noticed that if used as
>> is, there are a lot of translation errors (the ones with @), which
>> basically just put the lemma of the source language if a translation
>> isnt available. To deal with this, dictionary trimming was added, which
>> basically removed a word from the monodix if it wasn't present in the bidix
>> and it went through the pipeline as an unknown word and the source surface
>> form was found in the final translation (with a *), which is arguably
>> better and more intelligible than just the source lemma.
>>
>> However, trimming meant giving up certain benefits. Let's look at these
>> benefits in greater detail:
>>
>>    - *Lexical Selection:* By discarding the analysis of a word in the
>>    source language, we lose the ability to use it as context to disambiguate
>>    words in its context. Assume a [Noun Adjective] in which the we don't know
>>    the translation of the Adjective, i.e. it isn't in the bidix. With 
>> trimming
>>    we would discard it and hence if the Noun has several ambiguous forms, we
>>    have no way to disambiguate it since we've discarded the analysis of the
>>    Adjective (which included the fact that it's an adjective)
>>    - *Transfer:* In the same example, assume that in the target
>>    language, [Noun Adj] is to be rearranged into [Adj Noun]. With trimming,
>>    this can't be done as we've discarded the analysis of the Adjective,
>>    treating it as an unknown word.
>>
>> Now, if we don't discard the analysis and don't trim, we would again fall
>> into the earlier problem of untranslated lemmas.
>>
>> This project, is a way to have our cake and eat it too. We don't discard
>> the analysis even if we don't know the translation, but we don't just
>> output the lemma either - we output the source surface form. For a solution
>> like this, it is *essential that we propagate the surface form till at
>> least transfer or even till the generator*, so that we can use the
>> benefits of the source analysis and then before translation, we discard it
>> and use the source surface form.
>>
>> Currently the source surface form is discarded at the tagger. This is
>> where the stream modification comes in. It's a robust way to propagate the
>> surface form through the stream with least disruption to the current
>> modules.
>>
>> Then there are other possible benefits of secondary information, such as
>> markup tags. Hope this makes sense.
>>
>> Tanmai
>>
>> On Tue, Apr 21, 2020 at 12:02 AM Francis Tyers <fty...@prompsit.com>
>> wrote:
>>
>>> El 2020-04-20 19:21, Daniel Swanson escribió:
>>> >> Another way of putting this is that it looks like a technical
>>> > solution
>>> >> in search of a problem, rather than a problem description in search
>>> >> of a solution.
>>> >
>>> > To me the most obvious thing to do with it is to put markup
>>> > information in secondary tags as a way of solving the superblank
>>> > reordering problem.
>>> >
>>>
>>> Didn't we have a solution for this that was worked on over a couple
>>> of GSOC projects ?
>>>
>>> Fran
>>>
>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>>
>> --
>> *Khanna, Tanmai*
>>
>
>
> --
> *Khanna, Tanmai*
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to