Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Tanmai Khanna Sat, 13 Jun 2020 11:09:06 -0700

I'll try to clarify a few things here. These issues have been discussed at
length but for everyone's benefit, here is part of the discussion:

> 0) languages without spaces in the writing system:

  >  what is a surface form here? is it just the longest token matched?

The surface form here is what shows up as *xyz when a source word is
unknown. I'm not entirely sure what we do for unknown words with
orthographies that don't have a space, but that remains consistent here.

> 1) compounds

> i)  infrastruktuurontwikkelingsplan, does each part of the compound get
     the surface form tag? if so, one happens if one part of the compound
     is translated but the other parts aren't, e.g. would you get
     *infrastruktuurontwikkelingsplan *infrastruktuurontwikkelingsplan
> plan?

This particular example has been discussed in:
https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format#Compounds_.26_Surface_Form
It's also mentioned later that when compound parts break up they will each
share a compound id, that can be used to bring these together later on and
output only one unknown surface form. In case we don't want to take the
risk of breaking up compounds if a part of it may not be translated
(because some part can move around in transfer), then using weights in the
dictionary using bidix (also something that has been discussed), we can
emulate trimming just for compounds, since we can make a decision by saying
that we either want compounds to translate fully or not at all (something
that can be challenged, of course).

> 2) contractions

>i)  chawe - if you attach the surface form to both and both are unknown,
>do
 >    you get both in the output? if you only attach it to one, which one
>do you
 >    attach it to, where is that decision made?

After postchunk using a shared compound id, these are merged again and then
one unknown surface form shows up. In case we don't want to do this, we can
do the same thing as compounds and emulate its trimming.

> I think that the appropriate way to deal with this is by coming up with a
clear plan for the linguistic eventualities. I don't see that in the
current proposal.

In terms of compounds and how to deal with them, the entire discussion
about weighting monolingual dictionaries using bidix came about largely
because of those, so that we can emulate trimming for these cases. This was
mentioned in the first proposal as well:
https://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming#Compounds_and_multiwords
. In fact here we discussed that if -separable deals with multiwords, the
current trimming algorithm harms it because it trims away XY if both X and
Y aren't in the bidix, but the bidix might have XY as one unit.

infrastruktuurontwikkelingsplan -
>
^infrastruktuurontwikkelingsplan/infrastruktuur<n><cmp>+ontwikkeling<n><cmp>+plan<n><sg>$

-> ^intrastruktuur<n><!12>$[{sf:intrastruktuurontwikkelingsplan; cmp:1}]
^ontwikkelings<n><!13>$[{sf:intrastruktuurontwikkelingsplan; cmp:1}]
^plan<n><!14>$[{sf:intrastruktuurontwikkelingsplan; cmp:1}]

> chawe - ^chawe/chi<pr>+<px2sg>wech<n><rel>$

-> ^chi<pr><!1>$[[sf:chawe; cmp:1]] ^<px2sg>wech<n><rel><!2>$[[sf:chawe;
cmp:1]]

(This is the current wordbound blanks solution. This can also be put in
secondary tags in a similar manner.

> *Is there* an answer yet to what we *want* to happen in

> > such as what should happen to secondary information when tokens are
> > merged/split.

> ?

> Not about the algorithm or implementation or anything. Just, what
> would we like the result to be?

Yes. When tokens are split, secondary information is duplicated on both
parts, and both parts get an id which preserves the information that it was
once a part. When tokens are merged, all the secondary information is added
to the merged unit marked with reading ids. See separable output in :
https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format#Another_one

I'm not saying all linguistic eventualities are hashed out, but they're
being actively worked upon. If there is one that I'm not dealing with here,
I'm happy to discuss it.

Lastly, IDs on units help with better dependency markings, alignments, and
having information in wordbound blanks without filling up lexical units.
While the wordbound blanks solution is more complex and harder to
implement, it does seem to largely address everyone's concerns about the
original proposal.

None of what I'm saying here is new, but somehow this discussion is
stagnating more than most projects I've seen in action. If people are still
not convinced about any benefits that can come out of such a modification,
then I guess I'll choose the best solution that provides no regression,
keep it optional, and if it really is helpful then people can use it. :)

Thanks and Regards,
Tanmai Khanna

On Sat, Jun 13, 2020 at 11:13 PM Jonathan Washington <
jonathan.n.washing...@gmail.com> wrote:

> 13 iyn 2020, Ş. tarixində 13:15 tarixində <nlhow...@gmail.com> yazdı:
> >
> > Tino Didriksen wrote:
> > > On Sat, 13 Jun 2020 at 17:50, Francis Tyers <fty...@prompsit.com>
> wrote:
> > >
> > > > As far as I understand the objective is to be able
> > > > to
> > > > put the original surface form in the output translation as an unknown
> > > > token
> > > > instead of the lemma.
> > > >
> > > > ...
> > > >
> > > > I think that the appropriate way to deal with this is by coming up
> with
> > > > a
> > > > clear plan for the linguistic eventualities. I don't see that in the
> > > > current
> > > > proposal. I have been showing Tanmai through the creation of a new MT
> > > > system,
> > > > and we have been documenting these issues as they arise. I don't
> think
> > > > it makes
> > > > sense to start development before they have been resolved.
> > > >
> > >
> > >
> > > Those are important issues, but they're orthogonal to how to transport
> > > secondary information through the pipe.
> >
> > > Even at the earliest stages of the proposal, it was expanded to be
> > > 1) Get secondary tags through the pipe. 2) Use that ability to
> > > eliminate trimming. 3) Use the same ability for a myriad of other
> > > things, such as markup handling.
> >
> > If I understand the issue correctly, it isn't clear yet that (2) as
> > phrased is possible.
> >
> > *Is there* an answer yet to what we *want* to happen in
> >
> > > such as what should happen to secondary information when tokens are
> > > merged/split.
> >
> > ?
>
> I agree.  A proposed solution to the issues Fran raises need to be
> part of the proposal for transport format.  The issues are too closely
> intertwined.
>
> > Not about the algorithm or implementation or anything. Just, what
> > would we like the result to be?
> >
> > > We need to implement and solve #1 first - be able to transport (and
> > > potentially manipulate) any amount of data that might be needed to
> > > solve #2 and #3 and ... #9.
> >
> > I don't think it makes sense to mandate a mechanism we aren't
> > convinced will work...
>
> This has never been suggested as a mandate.  Whatever the approach to
> the issues at hand, the proposal is for an extra feature that
> translation pair developers may decide to use or ignore as they see
> fit.
>
> I want to also highlight Tino's point about urgency.  This is part of
> an active GSoC project, and that project needs to move forward.  That
> doesn't mean that this discussion shouldn't be allowed to take its
> time, but we really do need to find a path forward.
>
> --
> Jonathan
>
>
> > Cheers,
> > Nick
>

-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Reply via email to