I'm not familiar enough with the pipeline to completely understand this,
but arbitrary inline secondary tags with short textual prefixes do seem the
most logical.

On Sat, Jun 13, 2020 at 7:21 AM Tino Didriksen <m...@tinodidriksen.com>
wrote:

> I would like everyone to read and seriously consider this thread and give
> your opinion. This meanders a bit, so please read it all.
>
> Khannatanmai's GSoC project that initially had the headline of eliminating
> trimming has led to what I feel is a fundamental schism in thoughts about
> what Apertium's evolution should be. This actually goes back ~7 years to
> when we split monolingual from bilingual packages.
>
> References:
> - (2013)
> https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2013-04-12.log
> from [12:14:08] onwards
> -
> https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08383.html
> -
> https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08401.html
> -
> https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format
> -
> https://wiki.apertium.org/wiki/User:Khannatanmai/Alternate_stream_modification
> - https://wiki.apertium.org/wiki/Secondary_tags
>
> In order to even make trimming optional, we need to carry along the
> surface form. So that means we need to carry along 1 piece of optional data
> that has little to no linguistic use in the pipe. As a professional
> programmer, I took this opportunity to push for the ability to carry ANY
> optional data, because if we're going to fiddle with all modules in the
> pipe, we might as well do it so that any future use will be covered. The
> same mechanism can immediately be used to carry e.g. markup tags, casing
> information, and input token IDs for input-output alignment. But also
> things that we haven't thought of yet.
>
> And that early is where the first pushback comes. Spectie wants a
> linguistic reason for EVERY allowed piece of information in the pipe, and I
> absolutely don't. I want an implementation where some prefixes have known
> behavior, but unknown prefixes are allowed - so that we do not have to make
> any adjustments to the pipe if someone wants to try out passing a new
> secondary piece. A more flexible system now is less work in the future, and
> lets linguists try out new things without needing programmer help. So
> that's the first issue of contention.
>
> The second issue is the format that this secondary information should
> take. In the references above, we tried to get you all to actually consider
> the pro/con of formats, but didn't really get any strong opinions. Which in
> itself is fine - you mostly don't care how it looks, so long as it works
> and doesn't get in your way.
>
> Initially, we pushed for a format with inline secondary tags,
> a'la 
> ^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$.
> Spectie is opposed to this because it puts information on each reading that
> only pertains to the whole token - and it does. Each reading will carry
> data that has nothing to do with that reading, and sometimes even that
> token.
>
> Given this objection, Khannatanmai investigated alternative ways to have
> this data in the stream, which has coalesced to something like
> this: ^de<pr>/of<pr><!12>/from<pr><!13>$[[sf:del; id:5][ri:12,13;
> t:a:cecs]] - that is, use a word-bound blank with a 1:1 mapping of readings
> to data. This can work, but it's an order of magnitude harder to implement.
> It also has a problem with IDs: If a module changes the number of readings
> or tokens, it must come up with a new globally unique ID. This can be
> solved in various ways.
>
> One way with stepped IDs (100, 200, 300) requires all modules to
> look-ahead 1 token to know what range it can assign new IDs from - that is
> a difficult modification of all tools. Another way is to let each module
> form their own IDs, which means we'll end up with 4+ different IDs in the
> primary stream anyway, and all of them are needed to figure out which data
> belongs to that reading. Both options are again another order of magnitude
> harder to implement, and both options are harder for humans to read the
> stream and reason about what belongs where.
>
> I am not a linguist. I am a professional programmer. As a programmer, I
> care how it looks and works, because I care about how it's going to be
> implemented now and in the future. I care about how hard it will be for
> others (including non-programmers!) to implement tools to work with. That's
> why I am against using special symbols and why I am in favour of inline
> data.
>
> Yes, inline secondary data is linguistically impure. I recognize this. I
> still think it's worth it, and is the best way to do it. And I think that
> Apertium should allow this impurity, because it's easier for everyone else
> to work with. If it's a matter of reading the stream, we can trivially make
> visualizers and cleaners - even perl -pe 's/<[^<>:]+:[^<>]+>//g' will
> remove all the secondaries.
>
> Inline secondary tags and markup block handling will not hurt the
> linguistic parts of the pipeline - they're just not pure. And that should
> be ok. This is one place where the vastly simpler implementation should win
> over the linguistic purity.
>
> But that's where the fundamental schism part comes back, because this goes
> deeper than just this one GSoC project. The journey Apertium began by
> splitting monolinguals from bilinguals has led to and should further lead
> to a broader spectrum of uses, but this hasn't been consciously voiced by
> anyone: Apertium is now more than a pet project for machine translation.
>
> We have a wider ecosystem we should strive to work with. We already have
> spell checkers in monolinguals - those have nothing to do with machine
> translation. It is natural evolution that monolingual packages should be
> able to stand on their own and provide corpus analysis, computer assisted
> language learning (CALL), spell checkers, proofing tools, etc. Many of
> those uses will require easy-to-use non-linguistic secondary tags in some
> form. And even with machine translation, many uses will need secondary tags
> in some form or another.
>
> And I am not just talking without a basis here. I have implemented this
> kind of stuff in GrammarSoft's pipelines. I have practical experience with
> what the surrounding ecosystems want. We should make this easy and flexible
> now - not hacks upon hacks that need adjusting every year.
>
> I am not trying to usurp the linguistic basis. Naturally, Apertium should
> be developed linguistics-first - which I also made rather clear in the last
> PMC election. But I want to do away with the linguistics-only mindset.
>
> Practically, right now I want a mandate from the community and PMC to
> let Khannatanmai continue with inline secondary tags with short textual
> prefixes, as originally envisioned and discussed in prior emails.
>
> But I also want to open the discussion about what we actually want from
> Apertium, because there's clearly a difference of opinion that needs
> hashing out.
>
> -- Tino Didriksen
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to