I'm not familiar enough with the pipeline to completely understand this, but arbitrary inline secondary tags with short textual prefixes do seem the most logical.
On Sat, Jun 13, 2020 at 7:21 AM Tino Didriksen <m...@tinodidriksen.com> wrote: > I would like everyone to read and seriously consider this thread and give > your opinion. This meanders a bit, so please read it all. > > Khannatanmai's GSoC project that initially had the headline of eliminating > trimming has led to what I feel is a fundamental schism in thoughts about > what Apertium's evolution should be. This actually goes back ~7 years to > when we split monolingual from bilingual packages. > > References: > - (2013) > https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2013-04-12.log > from [12:14:08] onwards > - > https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08383.html > - > https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08401.html > - > https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format > - > https://wiki.apertium.org/wiki/User:Khannatanmai/Alternate_stream_modification > - https://wiki.apertium.org/wiki/Secondary_tags > > In order to even make trimming optional, we need to carry along the > surface form. So that means we need to carry along 1 piece of optional data > that has little to no linguistic use in the pipe. As a professional > programmer, I took this opportunity to push for the ability to carry ANY > optional data, because if we're going to fiddle with all modules in the > pipe, we might as well do it so that any future use will be covered. The > same mechanism can immediately be used to carry e.g. markup tags, casing > information, and input token IDs for input-output alignment. But also > things that we haven't thought of yet. > > And that early is where the first pushback comes. Spectie wants a > linguistic reason for EVERY allowed piece of information in the pipe, and I > absolutely don't. I want an implementation where some prefixes have known > behavior, but unknown prefixes are allowed - so that we do not have to make > any adjustments to the pipe if someone wants to try out passing a new > secondary piece. A more flexible system now is less work in the future, and > lets linguists try out new things without needing programmer help. So > that's the first issue of contention. > > The second issue is the format that this secondary information should > take. In the references above, we tried to get you all to actually consider > the pro/con of formats, but didn't really get any strong opinions. Which in > itself is fine - you mostly don't care how it looks, so long as it works > and doesn't get in your way. > > Initially, we pushed for a format with inline secondary tags, > a'la > ^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$. > Spectie is opposed to this because it puts information on each reading that > only pertains to the whole token - and it does. Each reading will carry > data that has nothing to do with that reading, and sometimes even that > token. > > Given this objection, Khannatanmai investigated alternative ways to have > this data in the stream, which has coalesced to something like > this: ^de<pr>/of<pr><!12>/from<pr><!13>$[[sf:del; id:5][ri:12,13; > t:a:cecs]] - that is, use a word-bound blank with a 1:1 mapping of readings > to data. This can work, but it's an order of magnitude harder to implement. > It also has a problem with IDs: If a module changes the number of readings > or tokens, it must come up with a new globally unique ID. This can be > solved in various ways. > > One way with stepped IDs (100, 200, 300) requires all modules to > look-ahead 1 token to know what range it can assign new IDs from - that is > a difficult modification of all tools. Another way is to let each module > form their own IDs, which means we'll end up with 4+ different IDs in the > primary stream anyway, and all of them are needed to figure out which data > belongs to that reading. Both options are again another order of magnitude > harder to implement, and both options are harder for humans to read the > stream and reason about what belongs where. > > I am not a linguist. I am a professional programmer. As a programmer, I > care how it looks and works, because I care about how it's going to be > implemented now and in the future. I care about how hard it will be for > others (including non-programmers!) to implement tools to work with. That's > why I am against using special symbols and why I am in favour of inline > data. > > Yes, inline secondary data is linguistically impure. I recognize this. I > still think it's worth it, and is the best way to do it. And I think that > Apertium should allow this impurity, because it's easier for everyone else > to work with. If it's a matter of reading the stream, we can trivially make > visualizers and cleaners - even perl -pe 's/<[^<>:]+:[^<>]+>//g' will > remove all the secondaries. > > Inline secondary tags and markup block handling will not hurt the > linguistic parts of the pipeline - they're just not pure. And that should > be ok. This is one place where the vastly simpler implementation should win > over the linguistic purity. > > But that's where the fundamental schism part comes back, because this goes > deeper than just this one GSoC project. The journey Apertium began by > splitting monolinguals from bilinguals has led to and should further lead > to a broader spectrum of uses, but this hasn't been consciously voiced by > anyone: Apertium is now more than a pet project for machine translation. > > We have a wider ecosystem we should strive to work with. We already have > spell checkers in monolinguals - those have nothing to do with machine > translation. It is natural evolution that monolingual packages should be > able to stand on their own and provide corpus analysis, computer assisted > language learning (CALL), spell checkers, proofing tools, etc. Many of > those uses will require easy-to-use non-linguistic secondary tags in some > form. And even with machine translation, many uses will need secondary tags > in some form or another. > > And I am not just talking without a basis here. I have implemented this > kind of stuff in GrammarSoft's pipelines. I have practical experience with > what the surrounding ecosystems want. We should make this easy and flexible > now - not hacks upon hacks that need adjusting every year. > > I am not trying to usurp the linguistic basis. Naturally, Apertium should > be developed linguistics-first - which I also made rather clear in the last > PMC election. But I want to do away with the linguistics-only mindset. > > Practically, right now I want a mandate from the community and PMC to > let Khannatanmai continue with inline secondary tags with short textual > prefixes, as originally envisioned and discussed in prior emails. > > But I also want to open the discussion about what we actually want from > Apertium, because there's clearly a difference of opinion that needs > hashing out. > > -- Tino Didriksen > > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff