Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tanmai Khanna Tue, 21 Apr 2020 10:06:25 -0700

Yup, if you see the transfer output as well, prepositions fail as transfer
matches "pr" and not "pr.*". Hence, all FSTs will be ignoring secondary
tags and there will be a separate matching mechanism for secondary tags.


The problem with treating secondary tags like primary tags is that
secondary tags will not be order dependent. We will not enforce that
surface form should be the first secondary tag, and markup should be
second, etc. Due to this, using FSTs to match them will make it way more
complicated.

Overall, for backwards compatibility and for better matching with secondary
tags, it makes a lot more sense for FSTs to ignore secondary tags (in
lt-proc, transfer, lex-sel, etc.) and yeah we can use a map for secondary
tags, also making use of its feature-value pair -ness.

Tanmai

On Tue, Apr 21, 2020 at 10:16 PM Daniel Swanson <awesomeevildu...@gmail.com>
wrote:

> I think what's written in the proposal is to have pattern matching FSTs
> skip secondary tags (in this case a small modification to lrx-proc).
>
> It was suggested that matching secondary tags would end up as some sort of
> hash table lookup separate from the FSTs, but I think it could also work to
> just have some way of specifying that you want secondary tags treated like
> normal tags for the purposes of matching, in which case presumably you
> would have written all your rules to use .*.
>
> On Tue, Apr 21, 2020 at 12:14 PM Jonathan Washington <
> jonathan.n.washing...@gmail.com> wrote:
>
>> The main thing I worry about here is lrx rules.
>>
>> Currently a lot of pairs have rules that match e.g. tags="adj", but not
>> necessarily tags="adj.*".  So something that's normally hargle<adj> might
>> now be hargle<adj><sf:hargle>, and that means the lrx rule won't match.
>>
>> Since we want this to be backwards-compatible (without rewriting rules),
>> the lrx compiler and/or processor will have to be rewritten to ignore
>> secondary tags for matching (unless a rule is written to check a secondary
>> tag??).
>>
>> I guess this sort of worry is the sort of thing you're keeping track of
>> so that it can be worked on?
>>
>> --
>> Jonathan
>>
>> On Mon, Apr 20, 2020, 14:52 Tanmai Khanna <khanna.tan...@gmail.com>
>> wrote:
>>
>>> In a nutshell, by using the source analysis for disambiguation and
>>> transfer, we make the translation output better, and by outputting the
>>> source surface form instead of the source lemma, we make the output more
>>> comprehensible, or post-editable.
>>>
>>> Tanmai
>>>
>>> On Tue, Apr 21, 2020 at 12:19 AM Tanmai Khanna <khanna.tan...@gmail.com>
>>> wrote:
>>>
>>>> Hey Francis,
>>>> I agree that it does seem like a solution searching for a problem if we
>>>> look at it in isolation. But it's important to look at this in the context
>>>> of eliminating trimming. Chronologically, this project was first about and
>>>> still is, about eliminating dictionary trimming. Modification to the stream
>>>> is just part of the solution - a solution that will help this problem, but
>>>> also potentially several other problems, such as the superblank reordering
>>>> problem. I went into detail about this in the proposal but I'll explain it
>>>> here.
>>>>
>>>> The monodix of a language is generally larger than the bidix for a
>>>> language pair involving that language pair. It was noticed that if used as
>>>> is, there are a lot of translation errors (the ones with @), which
>>>> basically just put the lemma of the source language if a translation
>>>> isnt available. To deal with this, dictionary trimming was added, which
>>>> basically removed a word from the monodix if it wasn't present in the bidix
>>>> and it went through the pipeline as an unknown word and the source surface
>>>> form was found in the final translation (with a *), which is arguably
>>>> better and more intelligible than just the source lemma.
>>>>
>>>> However, trimming meant giving up certain benefits. Let's look at these
>>>> benefits in greater detail:
>>>>
>>>>    - *Lexical Selection:* By discarding the analysis of a word in the
>>>>    source language, we lose the ability to use it as context to 
>>>> disambiguate
>>>>    words in its context. Assume a [Noun Adjective] in which the we don't 
>>>> know
>>>>    the translation of the Adjective, i.e. it isn't in the bidix. With 
>>>> trimming
>>>>    we would discard it and hence if the Noun has several ambiguous forms, 
>>>> we
>>>>    have no way to disambiguate it since we've discarded the analysis of the
>>>>    Adjective (which included the fact that it's an adjective)
>>>>    - *Transfer:* In the same example, assume that in the target
>>>>    language, [Noun Adj] is to be rearranged into [Adj Noun]. With trimming,
>>>>    this can't be done as we've discarded the analysis of the Adjective,
>>>>    treating it as an unknown word.
>>>>
>>>> Now, if we don't discard the analysis and don't trim, we would again
>>>> fall into the earlier problem of untranslated lemmas.
>>>>
>>>> This project, is a way to have our cake and eat it too. We don't
>>>> discard the analysis even if we don't know the translation, but we don't
>>>> just output the lemma either - we output the source surface form. For a
>>>> solution like this, it is *essential that we propagate the surface
>>>> form till at least transfer or even till the generator*, so that we
>>>> can use the benefits of the source analysis and then before translation, we
>>>> discard it and use the source surface form.
>>>>
>>>> Currently the source surface form is discarded at the tagger. This is
>>>> where the stream modification comes in. It's a robust way to propagate the
>>>> surface form through the stream with least disruption to the current
>>>> modules.
>>>>
>>>> Then there are other possible benefits of secondary information, such
>>>> as markup tags. Hope this makes sense.
>>>>
>>>> Tanmai
>>>>
>>>> On Tue, Apr 21, 2020 at 12:02 AM Francis Tyers <fty...@prompsit.com>
>>>> wrote:
>>>>
>>>>> El 2020-04-20 19:21, Daniel Swanson escribió:
>>>>> >> Another way of putting this is that it looks like a technical
>>>>> > solution
>>>>> >> in search of a problem, rather than a problem description in search
>>>>> >> of a solution.
>>>>> >
>>>>> > To me the most obvious thing to do with it is to put markup
>>>>> > information in secondary tags as a way of solving the superblank
>>>>> > reordering problem.
>>>>> >
>>>>>
>>>>> Didn't we have a solution for this that was worked on over a couple
>>>>> of GSOC projects ?
>>>>>
>>>>> Fran
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>>
>>>>
>>>> --
>>>> *Khanna, Tanmai*
>>>>
>>>
>>>
>>> --
>>> *Khanna, Tanmai*
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to