On 24 December 2013 16:34, Mikel Forcada <m...@dlsi.ua.es> wrote:

> Since transfer rules (.t1x, .t2x) have to move superblanks around
> explicitly, it may be the case that valid HTML or XML is rendered
> invalid. For instance, a translated ODT file may not open, or a
> translated XHTML page may not be valid.
>
> For instance a rule can move around <b pos="1"/> and <b pos="2"/>. If <b
> pos="1"/> is "<sometag>" and <b pos="2"/> is  "</sometag>", the result
> is that </sometag> comes before <sometag>, leading to invalid XML or HTML.
>
> Similar validity errors may be introduced when tags are lost or repeated.
>
> I see no easy way to solve this without a serious redesign of blank
> management (perhaps by keeping a standoff list of blanks outside the
> stream). But I think it's good to be aware of it.


Here's a solution, as also discussed with Francis back in 2011 (
http://alpha.visl.sdu.dk/~tino/pisg/freenode/logs/%23apertium_20110331.log@
[08:45:03] ), which involves storing the tags on each token.

Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>"
you turn that into

My <b><i>
sister <b><i>
lives <b><i>
in <b><u>
Wales <b><u>


Note that no closing tags are stored, and that only inline tags are stored
- a tag must be closed when it is no longer present on a token. If you only
let the system see and thus move around inline tags, you eliminate a huge
swath of problems. It is rare that people use inline tags for anything too
special.

Anyway, when transformed to "My sister in Wales lives" you have

My <b><i>
sister <b><i>
in <b><u>
Wales <b><u>
lives <b><i>

which will reformat to "<p><b><i>My sister</i> <u>in Wales</u>
<i>lives</i></b></p>".

This duplicates the <i> tag, but since it's an inline tag that usually does
not matter. And since the non-inline tag <p> counts as a hard break, it
cannot be moved around in the stream and won't mess up anything.

That's the basis of the method we use to handle DOCX, ODT, HTML, MediaWiki
XML, SDL XML, etc, without problems. Naturally you need to determine which
tags are considered inline and how to store the tags and all their
attributes on a per-token basis, but I can say that this method works
really well.

This method also cannot produce formally invalid output, though if people
abuse inline tags for fancy CSS markup then it may look weird (but as said,
that is rare).

And since tags are stored on the tokens, rules almost don't need to know
about them. They're free to move around and delete as needed. When
inserting a new token, it needs to adopt the neighbour tags.

-- Tino Didriksen
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to