I would like everyone to read and seriously consider this thread and give your opinion. This meanders a bit, so please read it all.
Khannatanmai's GSoC project that initially had the headline of eliminating trimming has led to what I feel is a fundamental schism in thoughts about what Apertium's evolution should be. This actually goes back ~7 years to when we split monolingual from bilingual packages. References: - (2013) https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2013-04-12.log from [12:14:08] onwards - https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08383.html - https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg08401.html - https://wiki.apertium.org/wiki/User:Khannatanmai/Secondary_info_apertium_stream_format - https://wiki.apertium.org/wiki/User:Khannatanmai/Alternate_stream_modification - https://wiki.apertium.org/wiki/Secondary_tags In order to even make trimming optional, we need to carry along the surface form. So that means we need to carry along 1 piece of optional data that has little to no linguistic use in the pipe. As a professional programmer, I took this opportunity to push for the ability to carry ANY optional data, because if we're going to fiddle with all modules in the pipe, we might as well do it so that any future use will be covered. The same mechanism can immediately be used to carry e.g. markup tags, casing information, and input token IDs for input-output alignment. But also things that we haven't thought of yet. And that early is where the first pushback comes. Spectie wants a linguistic reason for EVERY allowed piece of information in the pipe, and I absolutely don't. I want an implementation where some prefixes have known behavior, but unknown prefixes are allowed - so that we do not have to make any adjustments to the pipe if someone wants to try out passing a new secondary piece. A more flexible system now is less work in the future, and lets linguists try out new things without needing programmer help. So that's the first issue of contention. The second issue is the format that this secondary information should take. In the references above, we tried to get you all to actually consider the pro/con of formats, but didn't really get any strong opinions. Which in itself is fine - you mostly don't care how it looks, so long as it works and doesn't get in your way. Initially, we pushed for a format with inline secondary tags, a'la ^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$. Spectie is opposed to this because it puts information on each reading that only pertains to the whole token - and it does. Each reading will carry data that has nothing to do with that reading, and sometimes even that token. Given this objection, Khannatanmai investigated alternative ways to have this data in the stream, which has coalesced to something like this: ^de<pr>/of<pr><!12>/from<pr><!13>$[[sf:del; id:5][ri:12,13; t:a:cecs]] - that is, use a word-bound blank with a 1:1 mapping of readings to data. This can work, but it's an order of magnitude harder to implement. It also has a problem with IDs: If a module changes the number of readings or tokens, it must come up with a new globally unique ID. This can be solved in various ways. One way with stepped IDs (100, 200, 300) requires all modules to look-ahead 1 token to know what range it can assign new IDs from - that is a difficult modification of all tools. Another way is to let each module form their own IDs, which means we'll end up with 4+ different IDs in the primary stream anyway, and all of them are needed to figure out which data belongs to that reading. Both options are again another order of magnitude harder to implement, and both options are harder for humans to read the stream and reason about what belongs where. I am not a linguist. I am a professional programmer. As a programmer, I care how it looks and works, because I care about how it's going to be implemented now and in the future. I care about how hard it will be for others (including non-programmers!) to implement tools to work with. That's why I am against using special symbols and why I am in favour of inline data. Yes, inline secondary data is linguistically impure. I recognize this. I still think it's worth it, and is the best way to do it. And I think that Apertium should allow this impurity, because it's easier for everyone else to work with. If it's a matter of reading the stream, we can trivially make visualizers and cleaners - even perl -pe 's/<[^<>:]+:[^<>]+>//g' will remove all the secondaries. Inline secondary tags and markup block handling will not hurt the linguistic parts of the pipeline - they're just not pure. And that should be ok. This is one place where the vastly simpler implementation should win over the linguistic purity. But that's where the fundamental schism part comes back, because this goes deeper than just this one GSoC project. The journey Apertium began by splitting monolinguals from bilinguals has led to and should further lead to a broader spectrum of uses, but this hasn't been consciously voiced by anyone: Apertium is now more than a pet project for machine translation. We have a wider ecosystem we should strive to work with. We already have spell checkers in monolinguals - those have nothing to do with machine translation. It is natural evolution that monolingual packages should be able to stand on their own and provide corpus analysis, computer assisted language learning (CALL), spell checkers, proofing tools, etc. Many of those uses will require easy-to-use non-linguistic secondary tags in some form. And even with machine translation, many uses will need secondary tags in some form or another. And I am not just talking without a basis here. I have implemented this kind of stuff in GrammarSoft's pipelines. I have practical experience with what the surrounding ecosystems want. We should make this easy and flexible now - not hacks upon hacks that need adjusting every year. I am not trying to usurp the linguistic basis. Naturally, Apertium should be developed linguistics-first - which I also made rather clear in the last PMC election. But I want to do away with the linguistics-only mindset. Practically, right now I want a mandate from the community and PMC to let Khannatanmai continue with inline secondary tags with short textual prefixes, as originally envisioned and discussed in prior emails. But I also want to open the discussion about what we actually want from Apertium, because there's clearly a difference of opinion that needs hashing out. -- Tino Didriksen
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff