It's all transparent. Nobody has to add secondary information to the stream. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.
The work is to allow for arbitrary secondary information to be added to the stream. Initially for use with surface forms, so that we can eliminate the need for trimming and get input surface forms carried through to output. Pieces of information that would be useful to have in the stream includes but is not limited to: Surface form, syntactic function, sentiment, semantics (noun, adjective, verb), roles, verb frames, dependency, markup tags. As an example, http://codepad.org/Lq4xwyZr is how VISL + GramTrans' stream looks, after syntactic transfer. You've got the input tokens, re-arranged, with target analysis injected, but keeping original analysis and information about where this token was in the input. This is all needed. Some of this comes all the way from the input analysis, and is useful in several places along the way. It cannot be passed along via a separate channel (DAG-style) - it is inherently tied to each reading. What you could do, and what GramTrans does, is to store a lot of information in a separate channel, but still store handles to this information per-reading. E.g., that's how GramTrans handles markup tags - separate channel stores the full tag <a href="bla"></a> and stream passes along <t:a:5> - the stream still gets baseline information that there is an <a> tag, because that's relevant, but doesn't need all the attributes. Appending to each reading also allows for pausing the pipe in any location, and still retain all the information. Will Apertium need all that information? Not immediately. But currently it's impossible to append secondary information to readings, and it's hindering us. What we do need immediately is surface form and markup tags. And if we're going to modify the stream to transport this, we might as well allow anything. The proposal boils down to: tags with : in them are secondary, and secondary tags are always trailing. Make secondary tags not break the pipe. Modify the tokeniser to append surface form via <s:...> and generator to use surface form if there is no translation. -- Tino Didriksen On Sun, 29 Mar 2020 at 12:21, Mikel L. Forcada <m...@dlsi.ua.es> wrote: > Folks: > > The elders in Apertium will not be surprised if I voiced my opposition to > changing the format in the Apertium formats used between different modules > of the pipeline. In any case, this is affects the core functionality of > Apertium in many ways and its need should be justified in an uncontestable > way so that the PMC makes a decision to have a new version of Apertium > which should inevitably have paths to backward compatibility so that legacy > languages and language pairs work identically and without any loss of > performance. I believe we are far from "uncontestability", but that is just > my personal opinion. > > Currently, modes are linear pipelines. Any functionality requiring > information that is currently ingested by one module and not passed ahead > could be sent to later modules by teeing ( > https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would > have a directed acyclic graph, much as in tools like make, snakemake, dgsh ( > https://github.com/dspinellis/dgsh/wiki) etc. > > Any justification should first prove that the functionality is robust and > needed by working around the current format and modules, and be presented > in a level of formality which is comparable to that used currently in our > documentation. > > Having said that, no one cannot oppose people forking and testing. If the > new thing works, Apertium could bless the fork and merge it (depending on > how the fork handles provisions for legacy Apertium workflows). But, as I > said, this seems premature to me. But I am usually very conservative. > > Cheers > > Mikel >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff