Dear all,

two of my Google Code-In students (Darkgaia, Andrei) have found another source of HTML deformatting/reformatting errors and this seems to be a generic one.

The file in

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-en-es/dev/formatting-errors/file2.html

is valid, but the translation is not. This is due to the fact that an attribute contains ">" and the deformatter think that is the end of the tag.

Darkgaia explains it below (after the signature).

Just to be added to the bug list if there is one.

Cheers

Mikel

------------------------------------------------------------------------
Dear Apertium Developers!

My Google Code-in name is Darkgaia, and my real name is Lee Yang Peng.

I am writing a report regarding a major oversight in the Apertium deformatter which results in HTML translation errors in every language pair due to an inherent fault in HTML deformatter.

Please view this html example, check its validity:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title></title>
</head>
<body>
<img src="atob" alt="a>b"/>
</body>
</html>

Now, after translating this via apertium using any language pair, it returns an invalid result.

Here is the problem:

The deformatter assumes that when a ">" appears, it forms a tag with the previous >. This is plain wrong, as ">" is a perfectly legal character in attribute values. As a result, the tag is broken and the rest of the text goes through Apertium as if it were text. This invariably leads to errors in the surrounding text, by modifying main tags, and can be fatal in something like a meta tag or similar.

My suggestion to fix this problem:

Instead of the deformatter closing the tag immediately when it sees a ">", try adding another condition: "If the ">" is seen after a " symbol, ignore it until you see the second " symbol. What this means to say is, if a > is seen between an attribute value such as "hello>bye", that > will be ignored as it is between apostrophes. Otherwise, it can proceed as normal.

So, add a new rule that > tags should be ignored if seen between "".

That is all :)

Darkgaia



--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to