https://bugs.freedesktop.org/show_bug.cgi?id=76021
--- Comment #11 from Patrick Goetz <[email protected]> --- > If you want a valid XML document export it as XHTML, which is actually using > XML as a base. The problem with this is that the xhtml I get when I use "Export to xhtml" is, in my opinion, quite bizarre (however, similar to what you get with "Publish to the Web" using Google Docs). Using the attached .docx file as a starting point, this is what I get when I export to xhtml (snippet of file): <p class="P1"><span class="T1">Complainant</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">shall mean (a)</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T3">the</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T4">any</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">person or persons from whom the Intake Officer receives information concerning an Offense</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T4">and who, upon consent of that person(s), is designated a Complainant by the Intake Officer</span><span class="apple-converted-space"><span class="T2"> </span></span><span class="T2">or (b) any Injured Person designated by the Bishop Diocesan who in the Bishop Diocesan’s discretion, should be afforded the status of a Complainant, provided, however, that any Injured Person so designated may decline such designation.</span></p> (Ignoring that vim on the Windows XP machine I'm using is not reading the UTF-8 characters correctly), notice that common tags such as <b> and <i> are being inserted as classes using the <span> tag. In this case, .T1 maps to single CSS attribute: .T1 { font-weight:bold; } In a longer version of the same document (i.e. including more text from the same original document) you get more complex classes: .T1 { font-size:10pt; font-weight:bold; } .T13 { font-style:italic; } .T14 { font-style:italic; } .T15 { font-style:italic; } .T16 { font-style:italic; text-decoration:underline; } .T17 { font-style:italic; text-decoration:underline; } .T18 { font-style:italic; } .T19 { font-style:italic; font-weight:bold; } .T20 { font-style:italic; font-weight:bold; } .T21 { font-style:italic; font-weight:bold; } .T22 { font-style:italic; font-weight:bold; } .T26 { padding:0in; border-style:none; } .T27 { text-decoration:underline; } .T28 { text-decoration:underline; padding:0in; border-style:none; } .T29 { font-style:italic; text-decoration:underline; } This is both unreadable and hard to parse. Moreover, if I take exactly the same document and add some text, then all these classes change! Also note the strange duplication of classes that do exactly the same thing (.T13,.T14,.T15,.T18) In my application, what I need to do is extract the text, preserving simple formatting such as <p>, <b>, <i>, and (deprecated) <strike> in order to paste this content into another xml document. This is do-able using the exported xhtml, but extremely onerous; since, for example, it will require at least 2 passes through a parser: first to add the simple xhtml tags I want (<b>, <i>) that weren't included in the first place, then another pass to strip out all the remaining classes and other xhmtl coding that I don't want. I can't fathom why KISS isn't being applied here: use basic xhtml tags whenever possible in order to keep the output readable and sane. I've written a fair amount of XML parsing code myself, so do know something about it. I can't help but think this is an example of incredibly lazy programming (unless I'm missing something). -- You are receiving this mail because: You are the assignee for the bug.
_______________________________________________ Libreoffice-bugs mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
