Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Francis Tyers Sun, 14 Jun 2020 03:50:14 -0700

El 2020-06-13 15:20, Tino Didriksen escribió:

I would like everyone to read and seriously consider this thread and
give your opinion. This meanders a bit, so please read it all.

Khannatanmai's GSoC project that initially had the headline of
eliminating trimming has led to what I feel is a fundamental schism in
thoughts about what Apertium's evolution should be. This actually goes
back ~7 years to when we split monolingual from bilingual packages.

References:
- (2013)
https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2013-04-12.log
from [12:14:08] onwards
-
https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08383.html

-
https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08401.html
-
https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format
-
https://wiki.apertium.org/wiki/User:Khannatanmai/Alternate_stream_modification
- https://wiki.apertium.org/wiki/Secondary_tags

In order to even make trimming optional, we need to carry along the
surface form. So that means we need to carry along 1 piece of optional
data that has little to no linguistic use in the pipe. As a
professional programmer, I took this opportunity to push for the
ability to carry ANY optional data, because if we're going to fiddle
with all modules in the pipe, we might as well do it so that any
future use will be covered. The same mechanism can immediately be used
to carry e.g. markup tags, casing information, and input token IDs for
input-output alignment. But also things that we haven't thought of
yet.

And that early is where the first pushback comes. Spectie wants a
linguistic reason for EVERY allowed piece of information in the pipe,
and I absolutely don't. I want an implementation where some prefixes
have known behavior, but unknown prefixes are allowed - so that we do
not have to make any adjustments to the pipe if someone wants to try
out passing a new secondary piece. A more flexible system now is less
work in the future, and lets linguists try out new things without
needing programmer help. So that's the first issue of contention.

The second issue is the format that this secondary information should
take. In the references above, we tried to get you all to actually
consider the pro/con of formats, but didn't really get any strong
opinions. Which in itself is fine - you mostly don't care how it
looks, so long as it works and doesn't get in your way.

Initially, we pushed for a format with inline secondary tags, a'la
^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$.
Spectie is opposed to this because it puts information on each reading
that only pertains to the whole token - and it does. Each reading will
carry data that has nothing to do with that reading, and sometimes
even that token.

Given this objection, Khannatanmai investigated alternative ways to
have this data in the stream, which has coalesced to something like
this: ^de<pr>/of<pr><!12>/from<pr><!13>$[[sf:del; id:5][ri:12,13;
t:a:cecs]] - that is, use a word-bound blank with a 1:1 mapping of
readings to data. This can work, but it's an order of magnitude harder
to implement. It also has a problem with IDs: If a module changes the
number of readings or tokens, it must come up with a new globally
unique ID. This can be solved in various ways.

One way with stepped IDs (100, 200, 300) requires all modules to
look-ahead 1 token to know what range it can assign new IDs from -
that is a difficult modification of all tools. Another way is to let
each module form their own IDs, which means we'll end up with 4+
different IDs in the primary stream anyway, and all of them are needed
to figure out which data belongs to that reading. Both options are
again another order of magnitude harder to implement, and both options
are harder for humans to read the stream and reason about what belongs
where.

I am not a linguist. I am a professional programmer. As a programmer,
I care how it looks and works, because I care about how it's going to
be implemented now and in the future. I care about how hard it will be
for others (including non-programmers!) to implement tools to work
with. That's why I am against using special symbols and why I am in
favour of inline data.

Yes, inline secondary data is linguistically impure. I recognize this.
I still think it's worth it, and is the best way to do it. And I think
that Apertium should allow this impurity, because it's easier for
everyone else to work with. If it's a matter of reading the stream, we
can trivially make visualizers and cleaners - even perl -pe
's/<[^<>:]+:[^<>]+>//g' will remove all the secondaries.

Inline secondary tags and markup block handling will not hurt the
linguistic parts of the pipeline - they're just not pure. And that
should be ok. This is one place where the vastly simpler
implementation should win over the linguistic purity.

But that's where the fundamental schism part comes back, because this
goes deeper than just this one GSoC project. The journey Apertium
began by splitting monolinguals from bilinguals has led to and should
further lead to a broader spectrum of uses, but this hasn't been
consciously voiced by anyone: Apertium is now more than a pet project
for machine translation.

We have a wider ecosystem we should strive to work with. We already
have spell checkers in monolinguals - those have nothing to do with
machine translation. It is natural evolution that monolingual packages
should be able to stand on their own and provide corpus analysis,
computer assisted language learning (CALL), spell checkers, proofing
tools, etc. Many of those uses will require easy-to-use non-linguistic
secondary tags in some form. And even with machine translation, many
uses will need secondary tags in some form or another.

And I am not just talking without a basis here. I have implemented
this kind of stuff in GrammarSoft's pipelines. I have practical
experience with what the surrounding ecosystems want. We should make
this easy and flexible now - not hacks upon hacks that need adjusting
every year.

I am not trying to usurp the linguistic basis. Naturally, Apertium
should be developed linguistics-first - which I also made rather clear
in the last PMC election. But I want to do away with the
linguistics-only mindset.

Practically, right now I want a mandate from the community and PMC to
let Khannatanmai continue with inline secondary tags with short
textual prefixes, as originally envisioned and discussed in prior
emails.

But I also want to open the discussion about what we actually want
from Apertium, because there's clearly a difference of opinion that
needs hashing out.


In some ways I think that Tino is right here. I think that there is
a problem here in terms of vision. Having Apertium pulled and designed
in two incompatible ways is just going to end up wasting effort
and community goodwill.

Perhaps the best thing to do is for me to resign as president. We can
have another election and people can vote for the vision and design
that the prefer.

Best regards,

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Reply via email to