[Apertium-stuff] Apertium's Wider Use & Secondary Tags

Tino Didriksen Sat, 13 Jun 2020 07:21:33 -0700

I would like everyone to read and seriously consider this thread and give
your opinion. This meanders a bit, so please read it all.

Khannatanmai's GSoC project that initially had the headline of eliminating
trimming has led to what I feel is a fundamental schism in thoughts about
what Apertium's evolution should be. This actually goes back ~7 years to
when we split monolingual from bilingual packages.

References:
- (2013)
https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2013-04-12.log
from [12:14:08] onwards
-
https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08383.html
-
https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08401.html
-
https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format
-
https://wiki.apertium.org/wiki/User:Khannatanmai/Alternate_stream_modification
- https://wiki.apertium.org/wiki/Secondary_tags

In order to even make trimming optional, we need to carry along the surface
form. So that means we need to carry along 1 piece of optional data that
has little to no linguistic use in the pipe. As a professional programmer,
I took this opportunity to push for the ability to carry ANY optional data,
because if we're going to fiddle with all modules in the pipe, we might as
well do it so that any future use will be covered. The same mechanism can
immediately be used to carry e.g. markup tags, casing information, and
input token IDs for input-output alignment. But also things that we haven't
thought of yet.

And that early is where the first pushback comes. Spectie wants a
linguistic reason for EVERY allowed piece of information in the pipe, and I
absolutely don't. I want an implementation where some prefixes have known
behavior, but unknown prefixes are allowed - so that we do not have to make
any adjustments to the pipe if someone wants to try out passing a new
secondary piece. A more flexible system now is less work in the future, and
lets linguists try out new things without needing programmer help. So
that's the first issue of contention.

The second issue is the format that this secondary information should take.
In the references above, we tried to get you all to actually consider the
pro/con of formats, but didn't really get any strong opinions. Which in
itself is fine - you mostly don't care how it looks, so long as it works
and doesn't get in your way.

Initially, we pushed for a format with inline secondary tags,
a'la
^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$.
Spectie is opposed to this because it puts information on each reading that
only pertains to the whole token - and it does. Each reading will carry
data that has nothing to do with that reading, and sometimes even that
token.

Given this objection, Khannatanmai investigated alternative ways to have
this data in the stream, which has coalesced to something like
this: ^de<pr>/of<pr><!12>/from<pr><!13>$[[sf:del; id:5][ri:12,13;
t:a:cecs]] - that is, use a word-bound blank with a 1:1 mapping of readings
to data. This can work, but it's an order of magnitude harder to implement.
It also has a problem with IDs: If a module changes the number of readings
or tokens, it must come up with a new globally unique ID. This can be
solved in various ways.

One way with stepped IDs (100, 200, 300) requires all modules to look-ahead
1 token to know what range it can assign new IDs from - that is a difficult
modification of all tools. Another way is to let each module form their own
IDs, which means we'll end up with 4+ different IDs in the primary stream
anyway, and all of them are needed to figure out which data belongs to that
reading. Both options are again another order of magnitude harder to
implement, and both options are harder for humans to read the stream and
reason about what belongs where.

I am not a linguist. I am a professional programmer. As a programmer, I
care how it looks and works, because I care about how it's going to be
implemented now and in the future. I care about how hard it will be for
others (including non-programmers!) to implement tools to work with. That's
why I am against using special symbols and why I am in favour of inline
data.

Yes, inline secondary data is linguistically impure. I recognize this. I
still think it's worth it, and is the best way to do it. And I think that
Apertium should allow this impurity, because it's easier for everyone else
to work with. If it's a matter of reading the stream, we can trivially make
visualizers and cleaners - even perl -pe 's/<[^<>:]+:[^<>]+>//g' will
remove all the secondaries.

Inline secondary tags and markup block handling will not hurt the
linguistic parts of the pipeline - they're just not pure. And that should
be ok. This is one place where the vastly simpler implementation should win
over the linguistic purity.

But that's where the fundamental schism part comes back, because this goes
deeper than just this one GSoC project. The journey Apertium began by
splitting monolinguals from bilinguals has led to and should further lead
to a broader spectrum of uses, but this hasn't been consciously voiced by
anyone: Apertium is now more than a pet project for machine translation.

We have a wider ecosystem we should strive to work with. We already have
spell checkers in monolinguals - those have nothing to do with machine
translation. It is natural evolution that monolingual packages should be
able to stand on their own and provide corpus analysis, computer assisted
language learning (CALL), spell checkers, proofing tools, etc. Many of
those uses will require easy-to-use non-linguistic secondary tags in some
form. And even with machine translation, many uses will need secondary tags
in some form or another.

And I am not just talking without a basis here. I have implemented this
kind of stuff in GrammarSoft's pipelines. I have practical experience with
what the surrounding ecosystems want. We should make this easy and flexible
now - not hacks upon hacks that need adjusting every year.

I am not trying to usurp the linguistic basis. Naturally, Apertium should
be developed linguistics-first - which I also made rather clear in the last
PMC election. But I want to do away with the linguistics-only mindset.

Practically, right now I want a mandate from the community and PMC to
let Khannatanmai continue with inline secondary tags with short textual
prefixes, as originally envisioned and discussed in prior emails.

But I also want to open the discussion about what we actually want from
Apertium, because there's clearly a difference of opinion that needs
hashing out.

-- Tino Didriksen

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Apertium's Wider Use & Secondary Tags

Reply via email to