> On 17 Jun 2015, at 8:09 pm, Ian C <i...@amham.net> wrote:
> 
> Hi Peter,
> 
> when the Word converter creates an html element via the
> WordConverterCreateAbtract function it creates an associated id attribute.
> 
> Having examined the resulting html I see each element does have an id.
> 
> Are these necessary and if so when and where? I'm guessing some sort of
> lookup function somewhere?

The id attributes are used for two purposes:

1. To enable elements in an updated version of the document to be correlated 
with the elements from the original version
2. As a target for cross-references to figures, tables and headings.

The first one is the most important, since it applies to all elements, instead 
of only those that are targets of cross-references.

The number included in the id attribute is the “sequence number” of the node in 
the document (the seqNo field of DFNode). During parsing, these are assigned 
sequentially, starting from 0; as a result, sequence numbers in a document 
immediately after parsing represent are in the same order as they appear in the 
originating XML file.

This ordering does not really matter as such, but the consistency does - two 
parses of the same XML file are guaranteed to produce the same sequence 
numbers. The update process (HTML -> docx) relies on this guarantee, since it 
re-parses the docx file from which the HTML was generated, and assumes that the 
ids in the HTML match up with the sequence numbers obtained from the parse.

When new nodes are added to a document after parsing, the are assigned new 
sequence numbers consecutively, starting with the first number after what has 
been assigned so far.

DFDocument maintains a mapping from id attributes to Nodes. So if you have a 
node in the document.xml file, say, and you want to find the corresponding HTML 
element (if it exists), then you construct a string with the id prefix and the 
sequence number, and then do a lookup in the nodesByIdAttr hash table of the 
DFDocument object. There is a convenience function that does this, called 
DFElementForIdAttr(). This function is used in WordBookmarks and WordFields for 
dealing with cross-references.

WordConverterCreateAbstract() is used for creating a HTML element in the ‘get’ 
operation. It sets the id attribute based on the prefix used during conversion, 
and the sequence number of the supplied concrete element. This sets up the 
relationship, which is subsequently used in the ‘put’ operation.

WordConverterGetConcrete() does the reverse. It takes as input a HTML element 
from the abstract document, and checks to see if it has an id attribute. If so, 
it extracts the sequence number from the attribute, and uses that to locate the 
concrete element (typically in document.xml) from which that HTML element was 
originally derived. 

Once it has determined the sequence number, WordConverterGetConcrete() calls 
DFNodeForSeqNo(), which uses a hash table maintained by the document to map 
sequence numbers to nodes. The result may be NULL, indicating that there is no 
such node in the document, though in general that’s unlikely.

The most important use of WordConverterGetConcrete() is in WordContainerPut(), 
which is a wrapper around BDTContainerPut. The BDTContainerPut function is what 
handles the re-ordering of nodes (e.g. if a paragraph was moved to a different 
part of the HTML document, we move it’s counterpart in document.xml, retaining 
all supported and unsupported properties, e.g. certain formatting options that 
can’t be expressed in HTML).

Hope this clears things up a little bit… let me know if you need me to clarify 
anything further.

And yes, I believe we’ll need the same thing for ODF, in order to properly 
handle bidirectional transformation, which allows us to preserve aspects of the 
ODF document that we don’t yet (or can’t) express in HTML. Perhaps this can be 
abstracted in a generic manner so that it can be used by both filters (and 
others in the future).

—
Dr Peter M. Kelly
pmke...@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Reply via email to