Re: OOXML

Peter Kelly Mon, 04 Aug 2014 01:28:26 -0700

On 4 Aug 2014, at 12:16 am, jan i <[email protected]> wrote:

> By painfull experience, I found out that our internal (memory) structure is
> a superset of mixed ODF and pre-odf items. I dont think you can have a pure
> odf/OOXML memory structure, you need internal pointers as well (like
> start/finish of copy buffer)...but of course those 2 parts should have been
> well separated.


It's possible in theory, though I'm not familiar enough with the OO codebase to 
say whether it would work in practice.

The key idea is to maintain two separate data structures - one which is the ODF 
XML trees, and another which is the internal representation. Any time a change 
gets made to the former, the implementation must update the latter to reflect 
the change. Modification operations on the latter would need to go in the other 
direction.

This is how WebKit works (well, at least how it worked last time I touched the 
code, which was more than 10 years ago...). There is the DOM tree and the 
rendering tree. The DOM tree stores the HTML structure exactly as it was parsed 
from the original file; this is accessible to javascript code and can be 
modified in arbitrary ways. Whenever the DOM tree changes, WebKit updates its 
rendering tree, based both on the DOM tree and applicable rules from the CSS 
stylesheet. The rendering tree is the internal model which is used for 
displaying the content on screen.

Importantly, the DOM tree is also allowed to contain arbitrary XML elements in 
any namespace. This is how WebODF works; it includes the content.xml from the 
package directly, and that's the "authoritative" data structure that is 
manipulated during editing. The CSS rules WebODF uses control rendering of the 
content.

> I wonder, you wrote earlier that UXwrite uses html internally, that seems
> for me as the lowest common nominator...I would have thought a real
> superset would have been the better choise ?

Well a convenient thing about HTML is that you can include your extensions 
without affecting the rendered output, or risking loss of the data. This 
includes custom elements, custom attributes, and CSS style names that you may 
choose to assign special meaning to.

The reasons for this are largely due to the way in which HTML has historically 
evolved... browsers deliberately allow the presence of "invalid" elements they 
don't know about, to cater for future versions of the spec which add new 
elements. The idea is "graceful degradation", such that if you try to view a 
site that uses some new HTML features your browser doesn't support, it should 
at least in theory still let you see most of the content, just that you won't 
be able to use the new features. Depending on the HTML/CSS design, this works 
better in practice on some sites than on others. Then of course there's 
JavaScript APIs which can cause compatibility issues, though that's a separate 
topic, and the browser will usually at least display the content even if it 
can't do dynamic stuff because the JS code threw an exception.

In the case of UX Write, there's a few instances where I've used custom 
extensions to handle certain things. The main ones are:

1. Table of contents/list of tables/list of figures.

When you insert one of these into your document, it inserts a <nav> element 
with a CSS class name of "tableofcontents", "listoffigures", or "listoftables", 
which were chosen as these are the same keywords that LaTeX uses for these 
features. UX Write treats these as having special meaning, in the sense that 
when opening a document (and when the document is modified), it updates the 
content of these <nav> elements based on the set of all heading, figure, or 
table elements in the document (including numbering/captions).

2. OOXML-specific features.

When converting from .docx to .html during the process of opening a document, 
it assigns certain pre-defined CSS class names to particular types of HTML 
elements to indicate their purpose. For example, a cross-reference whose 
display format is supposed to include both the label and caption of a figure 
will be translated as:

<a href="#idN" class="uxwrite-ref-label-num">...</a>

where N is the id of the target. The editing code knows about these class names 
and uses them to update the text inside the <a> element if the figure number or 
caption changes. Similarly, where there is an unsupported object, like an 
embedded spreadsheet, it will translate this as:

<span class="uxwrite-placeholder">[Unsupported object]</span>.

During editing, WebKit preserves these, since they're just CSS class names and 
don't in any way cause problems with the HTML or rendering. All of the core 
editing operations are implemented in javascript, and these take the class 
names into account where appropriate.

3. Element mappings for bidirectional transformation.

For every HTML element that is generated from an OOXML element, it sets the id 
attribute to a string of the form bdt(N)-(M), where N is a randomly-generated 
number for each editing session, and M is the sequence number of the element in 
the OOXML tree. The purpose of the randomly-generated N value is to ensure that 
there aren't mixups for BDT updates if that HTML content gets copied & pasted 
into another document within UX Write itself. The number used for the M value 
is the position of the element in a pre-order traversal of the XML tree of 
document.xml. In cases where the element corresponds to an XML file in the 
package that is *not* the main content (currently only for the case of 
footnotes and endnotes), it is prefixed with a string identifying the file, so 
it can be properly identified.

When a document is saved, and the BDT update process takes place, it uses these 
to re-establish the relationship between elements in the HTML file and elements 
in the OOXML content tree, and figure out where changes have taken place. Given 
this mapping, it is able to update the OOXML file based on content from the 
HTML file.

This is all fully conformant with the HTML spec, as it allows you to choose 
whatever values you want for id attributes. And the editor neither knows nor 
cares whether the file it's working with was stored as .html or .docx; what 
happens on save is entirely separate from what happens during editing. In the 
case of HTML, the file is just saved directly, and in the case of .docx, the 
BDT process described above occurs. I'll be using exactly this same approach 
for supporting .odt files.

4. Extra elements to indicate selection

The iOS version of WebKit has a broken selection API (or at least did at the 
time I began writing the app, which was in the days of iOS 5), so I had to 
"fake" selections by creating my own <div> and <span> elements with the 
light-blue background colour. These are just regular HTML elements with CSS 
styling - nothing special about them. The editor keeps track of which elements 
in the document are used for faking selections, and these are removed before 
save; it's a runtime thing only.

In addition to all of the above, there are additional data structures 
maintained by the javascript code for information that isn't possible to 
represent (or doesn't make sense to represent) in the HTML structure itself. 
This includes a list of undo/redo operations, event listeners for changes to 
elements that would affect the table of contents/cross-references, an abstract 
tree representing the document outline, and so forth. These are all javascript 
objects; but they are separate from the DOM tree, and as far as opening & 
saving a file is concerned, have no effect on that. The HTML DOM remains the 
core data structure used, and WebKit preserves all the information needed.

> Some parts of AOO uses the structure directly others go through the API,
> that is not very clean, and makes it extremly difficult to test chaanges in
> the internal memory layout. An application like this (and many other
> similar types), should see the memory as a capsule, with a fixed API around
> it.

Agreed; I think it's important to maintain a separation between the internal 
data structures used by the editor and other code (file format loading/saving, 
automated tests, and plugins), so that the internal structures can be changed 
without affecting any of these.

--
Dr. Peter M. Kelly
Founder, UX Productivity
[email protected]
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: OOXML

Reply via email to