Dear all,
two of my Google Code-In students (Darkgaia, Andrei) have found another
source of HTML deformatting/reformatting errors and this seems to be a
generic one.
The file in
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-en-es/dev/formatting-errors/file2.html
is valid, but the translation is not. This is due to the fact that an
attribute contains ">" and the deformatter think that is the end of the tag.
Darkgaia explains it below (after the signature).
Just to be added to the bug list if there is one.
Cheers
Mikel
------------------------------------------------------------------------
Dear Apertium Developers!
My Google Code-in name is Darkgaia, and my real name is Lee Yang Peng.
I am writing a report regarding a major oversight in the Apertium
deformatter which results in HTML translation errors in every language
pair due to an inherent fault in HTML deformatter.
Please view this html example, check its validity:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<img src="atob" alt="a>b"/>
</body>
</html>
Now, after translating this via apertium using any language pair, it
returns an invalid result.
Here is the problem:
The deformatter assumes that when a ">" appears, it forms a tag with the
previous >. This is plain wrong, as ">" is a perfectly legal character
in attribute values. As a result, the tag is broken and the rest of the
text goes through Apertium as if it were text. This invariably leads to
errors in the surrounding text, by modifying main tags, and can be fatal
in something like a meta tag or similar.
My suggestion to fix this problem:
Instead of the deformatter closing the tag immediately when it sees a
">", try adding another condition: "If the ">" is seen after a " symbol,
ignore it until you see the second " symbol. What this means to say is,
if a > is seen between an attribute value such as "hello>bye", that >
will be ignored as it is between apostrophes. Otherwise, it can proceed
as normal.
So, add a new rule that > tags should be ignored if seen between "".
That is all :)
Darkgaia
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff