Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Samuel Sloniker Sat, 13 Jun 2020 07:33:06 -0700

I'm not familiar enough with the pipeline to completely understand this,
but arbitrary inline secondary tags with short textual prefixes do seem the
most logical.


On Sat, Jun 13, 2020 at 7:21 AM Tino Didriksen <m...@tinodidriksen.com>
wrote:

> I would like everyone to read and seriously consider this thread and give
> your opinion. This meanders a bit, so please read it all.
>
> Khannatanmai's GSoC project that initially had the headline of eliminating
> trimming has led to what I feel is a fundamental schism in thoughts about
> what Apertium's evolution should be. This actually goes back ~7 years to
> when we split monolingual from bilingual packages.
>
> References:
> - (2013)
> https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2013-04-12.log
> from [12:14:08] onwards
> -
> https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08383.html
> -
> https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08401.html
> -
> https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format
> -
> https://wiki.apertium.org/wiki/User:Khannatanmai/Alternate_stream_modification
> - https://wiki.apertium.org/wiki/Secondary_tags
>
> In order to even make trimming optional, we need to carry along the
> surface form. So that means we need to carry along 1 piece of optional data
> that has little to no linguistic use in the pipe. As a professional
> programmer, I took this opportunity to push for the ability to carry ANY
> optional data, because if we're going to fiddle with all modules in the
> pipe, we might as well do it so that any future use will be covered. The
> same mechanism can immediately be used to carry e.g. markup tags, casing
> information, and input token IDs for input-output alignment. But also
> things that we haven't thought of yet.
>
> And that early is where the first pushback comes. Spectie wants a
> linguistic reason for EVERY allowed piece of information in the pipe, and I
> absolutely don't. I want an implementation where some prefixes have known
> behavior, but unknown prefixes are allowed - so that we do not have to make
> any adjustments to the pipe if someone wants to try out passing a new
> secondary piece. A more flexible system now is less work in the future, and
> lets linguists try out new things without needing programmer help. So
> that's the first issue of contention.
>
> The second issue is the format that this secondary information should
> take. In the references above, we tried to get you all to actually consider
> the pro/con of formats, but didn't really get any strong opinions. Which in
> itself is fine - you mostly don't care how it looks, so long as it works
> and doesn't get in your way.
>
> Initially, we pushed for a format with inline secondary tags,
> a'la 
> ^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$.
> Spectie is opposed to this because it puts information on each reading that
> only pertains to the whole token - and it does. Each reading will carry
> data that has nothing to do with that reading, and sometimes even that
> token.
>
> Given this objection, Khannatanmai investigated alternative ways to have
> this data in the stream, which has coalesced to something like
> this: ^de<pr>/of<pr><!12>/from<pr><!13>$[[sf:del; id:5][ri:12,13;
> t:a:cecs]] - that is, use a word-bound blank with a 1:1 mapping of readings
> to data. This can work, but it's an order of magnitude harder to implement.
> It also has a problem with IDs: If a module changes the number of readings
> or tokens, it must come up with a new globally unique ID. This can be
> solved in various ways.
>
> One way with stepped IDs (100, 200, 300) requires all modules to
> look-ahead 1 token to know what range it can assign new IDs from - that is
> a difficult modification of all tools. Another way is to let each module
> form their own IDs, which means we'll end up with 4+ different IDs in the
> primary stream anyway, and all of them are needed to figure out which data
> belongs to that reading. Both options are again another order of magnitude
> harder to implement, and both options are harder for humans to read the
> stream and reason about what belongs where.
>
> I am not a linguist. I am a professional programmer. As a programmer, I
> care how it looks and works, because I care about how it's going to be
> implemented now and in the future. I care about how hard it will be for
> others (including non-programmers!) to implement tools to work with. That's
> why I am against using special symbols and why I am in favour of inline
> data.
>
> Yes, inline secondary data is linguistically impure. I recognize this. I
> still think it's worth it, and is the best way to do it. And I think that
> Apertium should allow this impurity, because it's easier for everyone else
> to work with. If it's a matter of reading the stream, we can trivially make
> visualizers and cleaners - even perl -pe 's/<[^<>:]+:[^<>]+>//g' will
> remove all the secondaries.
>
> Inline secondary tags and markup block handling will not hurt the
> linguistic parts of the pipeline - they're just not pure. And that should
> be ok. This is one place where the vastly simpler implementation should win
> over the linguistic purity.
>
> But that's where the fundamental schism part comes back, because this goes
> deeper than just this one GSoC project. The journey Apertium began by
> splitting monolinguals from bilinguals has led to and should further lead
> to a broader spectrum of uses, but this hasn't been consciously voiced by
> anyone: Apertium is now more than a pet project for machine translation.
>
> We have a wider ecosystem we should strive to work with. We already have
> spell checkers in monolinguals - those have nothing to do with machine
> translation. It is natural evolution that monolingual packages should be
> able to stand on their own and provide corpus analysis, computer assisted
> language learning (CALL), spell checkers, proofing tools, etc. Many of
> those uses will require easy-to-use non-linguistic secondary tags in some
> form. And even with machine translation, many uses will need secondary tags
> in some form or another.
>
> And I am not just talking without a basis here. I have implemented this
> kind of stuff in GrammarSoft's pipelines. I have practical experience with
> what the surrounding ecosystems want. We should make this easy and flexible
> now - not hacks upon hacks that need adjusting every year.
>
> I am not trying to usurp the linguistic basis. Naturally, Apertium should
> be developed linguistics-first - which I also made rather clear in the last
> PMC election. But I want to do away with the linguistics-only mindset.
>
> Practically, right now I want a mandate from the community and PMC to
> let Khannatanmai continue with inline secondary tags with short textual
> prefixes, as originally envisioned and discussed in prior emails.
>
> But I also want to open the discussion about what we actually want from
> Apertium, because there's clearly a difference of opinion that needs
> hashing out.
>
> -- Tino Didriksen
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Reply via email to