On Sat, Jun 13, 2020 at 04:50:48PM +0100, Francis Tyers wrote: > El 2020-06-13 15:20, Tino Didriksen escribió: > > I would like everyone to read and seriously consider this thread and > > give your opinion. This meanders a bit, so please read it all. > > > > Here is a non-exhaustive list of potential pitfalls of using the "surface > form is a tag" thing. As far as I understand the objective is to be able to > put the original surface form in the output translation as an unknown token > instead of the lemma.
yeah in practice the purpose is to eventually let any module anywhere use the surface form for anything, this includes giving option to print *surfaceform or @lemma without hacking the dictionaries. > 0) languages without spaces in the writing system: > > what is a surface form here? is it just the longest token matched? I always very naively thought that when we talk surface forms we talk about the span of text in original source input that the analysis concerns, there shouldn't be complications to this cause the source is always simply there. Compare to the surf field of conll-u and its validations. > 1) compounds > > i) infrastruktuurontwikkelingsplan, does each part of the compound get > the surface form tag? if so, one happens if one part of the compound > is translated but the other parts aren't, e.g. would you get > *infrastruktuurontwikkelingsplan *infrastruktuurontwikkelingsplan plan? All the stuff stored in the stream will let linguist choose whichever is good. When the things are there. Before that there will be no regressions in the streams and that is verified by comprehensive testing. > 2) contractions > > i) chawe - if you attach the surface form to both and both are unknown, do > you get both in the output? if you only attach it to one, which one do > you > attach it to, where is that decision made? > > ii) dárselo - if you attach the surface form to the clitic pronouns in > addition > to the verb, what happens if the verb is not in the dictionary but the > clitic > pronouns are? do you get the surface form and the translations in the > output? I guess I'm starting to see where you predict the problems will be, with the already a bit dodgy multitoken word features (subwords?) between apertium and cg streams? the question of what happens I'd want answer to be that after the implementation we will by default have the same output as before, and enough information in the streams for linguist to make informed decisions on what to output, if they want to output something nicer, I mean, even with Finnish enclitic particles the answer depends on the particle. If there is a limitation in the current stream format ideas preventing this we should probably make a test case example of it. I feel like we can output many good versions with current idea but haven't played it through on paper. -- Regards, Flammie <https://flammie.github.io> (Please note, that I will often include my replies inline instead of top or bottom of the mail)
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff