Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tino Didriksen Sun, 29 Mar 2020 04:32:00 -0700

It's all transparent. Nobody has to add secondary information to the
stream. All current pipes will continue to work as-is, unmodified. All old
data and files remain valid.

The work is to allow for arbitrary secondary information to be added to the
stream. Initially for use with surface forms, so that we can eliminate the
need for trimming and get input surface forms carried through to output.

Pieces of information that would be useful to have in the stream includes
but is not limited to: Surface form, syntactic function, sentiment,
semantics (noun, adjective, verb), roles, verb frames, dependency, markup
tags.

As an example, http://codepad.org/Lq4xwyZr is how VISL + GramTrans' stream
looks, after syntactic transfer. You've got the input tokens, re-arranged,
with target analysis injected, but keeping original analysis and
information about where this token was in the input. This is all needed.
Some of this comes all the way from the input analysis, and is useful in
several places along the way.

It cannot be passed along via a separate channel (DAG-style) - it is
inherently tied to each reading. What you could do, and what GramTrans
does, is to store a lot of information in a separate channel, but still
store handles to this information per-reading. E.g., that's how GramTrans
handles markup tags - separate channel stores the full tag <a
href="bla"></a> and stream passes along <t:a:5> - the stream still gets
baseline information that there is an <a> tag, because that's relevant, but
doesn't need all the attributes.

Appending to each reading also allows for pausing the pipe in any location,
and still retain all the information.

Will Apertium need all that information? Not immediately. But currently
it's impossible to append secondary information to readings, and it's
hindering us. What we do need immediately is surface form and markup tags.
And if we're going to modify the stream to transport this, we might as well
allow anything.

The proposal boils down to: tags with : in them are secondary, and
secondary tags are always trailing. Make secondary tags not break the pipe.
Modify the tokeniser to append surface form via <s:...> and generator to
use surface form if there is no translation.

-- Tino Didriksen

On Sun, 29 Mar 2020 at 12:21, Mikel L. Forcada <m...@dlsi.ua.es> wrote:

> Folks:
>
> The elders in Apertium will not be surprised if I voiced my opposition to
> changing the format in the Apertium formats used between different modules
> of the pipeline. In any case, this is affects the core functionality of
> Apertium in many ways and its need should be justified in an uncontestable
> way so that the PMC makes a decision to have a new version of Apertium
> which should inevitably have paths to backward compatibility so that legacy
> languages and language pairs work identically and without any loss of
> performance. I believe we are far from "uncontestability", but that is just
> my personal opinion.
>
> Currently, modes are linear pipelines. Any functionality requiring
> information that is currently ingested by one module and not passed ahead
> could be sent to later modules by teeing (
> https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would
> have a directed acyclic graph, much as in tools like make, snakemake, dgsh (
> https://github.com/dspinellis/dgsh/wiki) etc.
>
> Any justification should first prove that the functionality is robust and
> needed by working around the current format and modules, and be presented
> in a level of formality which is comparable to that used currently in our
> documentation.
>
> Having said that, no one cannot oppose people forking and testing. If the
> new thing works, Apertium could bless the fork and merge it (depending on
> how the fork handles provisions for legacy Apertium workflows). But, as I
> said, this seems premature to me. But I am usually very conservative.
>
> Cheers
>
> Mikel
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to