I'll just share our obviously superior silver bullet way again. Earlier
discussion was from
http://alpha.visl.sdu.dk/~tino/pisg/freenode/logs/%23apertium_20110331.log(starts
at [08:45:03]).

The core idea is to make all inline HTML/XML elements (A, B, I, U, SPAN,
etc) be tags on the words so that they can be freely moved around. The
post-processing then runs through and spits out an opening tag when it first
sees a HTML element and a closing tag when it no longer sees that HTML
element.

Meaning, the intermediate internal format does NOT know where tags open or
close; that is up to the post-processor to determine. This is the only way
to correctly split or merge HTML elements that I can think of.

Block level HTML elements (DIV, P, TD, etc) are protected and count as
sentence barriers.
Special elements (SCRIPT, STYLE, etc) are entirely protected and their
contents won't be attempted translated or even modified in any way.

Example input:
<p><b>My sister lives</b> <i>in Wales</i>.</p>
...would essentially get broken down to
<p>
My <b>
sister <b>
lives <b>
in <i>
Wales <i>
.
</p>
...re-ordered to "My sister in Wales lives."
<p>
My <b>
sister <b>
in <i>
Wales <i>
lives <b>
.
</p>
...which would be recombined to
<p><b>My sister</b> <i>in Wales</i> <b>lives</b>.</p>

It's not perfect, but it's pretty good. We've been able to handle HTML, XML,
OOXML, and many other formats with some magic around this method.

-- Tino Didriksen
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to