On Sat, Jun 13, 2020 at 04:50:48PM +0100, Francis Tyers wrote:
> El 2020-06-13 15:20, Tino Didriksen escribió:
> > I would like everyone to read and seriously consider this thread and
> > give your opinion. This meanders a bit, so please read it all.
> > 
> 
> Here is a non-exhaustive list of potential pitfalls of using the "surface
> form is a tag" thing. As far as I understand the objective is to be able to
> put the original surface form in the output translation as an unknown token
> instead of the lemma.

yeah in practice the purpose is to eventually let any module anywhere
use the surface form for anything, this includes giving option to print
*surfaceform or @lemma without hacking the dictionaries.

> 0) languages without spaces in the writing system:
> 
>    what is a surface form here? is it just the longest token matched?

I always very naively thought that when we talk surface forms we talk
about the span of text in original source input that the analysis
concerns, there shouldn't be complications to this cause the source is
always simply there. Compare to the surf field of conll-u and its
validations.

> 1) compounds
> 
> i)  infrastruktuurontwikkelingsplan, does each part of the compound get
>     the surface form tag? if so, one happens if one part of the compound
>     is translated but the other parts aren't, e.g. would you get
>     *infrastruktuurontwikkelingsplan *infrastruktuurontwikkelingsplan plan?

All the stuff stored in the stream will let linguist choose whichever is
good. When the things are there. Before that there will be no
regressions in the streams and that is verified by comprehensive
testing.

> 2) contractions
> 
> i)  chawe - if you attach the surface form to both and both are unknown, do
>     you get both in the output? if you only attach it to one, which one do
> you
>     attach it to, where is that decision made?
> 
> ii) dárselo - if you attach the surface form to the clitic pronouns in
> addition
>    to the verb, what happens if the verb is not in the dictionary but the
> clitic
>    pronouns are? do you get the surface form and the translations in the
> output?

I guess I'm starting to see where you predict the problems will be, with
the already a bit dodgy multitoken word features (subwords?) between
apertium and cg streams?

the question of what happens I'd want answer to be that after the
implementation we will by default have the same output as before, and
enough information in the streams for linguist to make informed
decisions on what to output, if they want to output something nicer, I
mean, even with Finnish enclitic particles the answer depends on the
particle. 

If there is a limitation in the current stream format ideas preventing
this we should probably make a test case example of it. I feel like we
can output many good versions with current idea but haven't played it
through on paper.


-- 
Regards, Flammie <https://flammie.github.io>
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to