RE: merge mode for XML

Peter Ring Sun, 28 Apr 2002 23:11:09 -0700

One of the most widespread uses of XML is as a neutral storage
and exchange format for documents. In these cases, avoiding XML
or SGML would just imply going back to Word or FrameMaker (and we
don't want that), or to LateX or Texi, which are similar wrt.
merging. Or HTML, an application of SGML. And anyway, a lot of
documentation for open source projects are being written or
converted to DocBook, and will be maintained using the same
revision control tools as the rest of the projects, i.e., cvs.
So we are going to see questions about XML or SGML pop up more
frequently.


Many of the issues wrt. cvs are essentially not much different
from maintaining documentation written in LateX or Texi format.
Except you can't assume that authors will be using vi or emacs.
A lot of different tools will be used -- that was one of the
main points of using SGML or XML.

It is common sense to break up a document into mini- or
micro-documents with each their own lifecycle -- just as you do
for programming source code. The concept of storage management
is built into SGML and XML at a very low level. The customary
way to do this is by declaring entitities, symbolic names for
storage objects, which can then be included in other documents
at appropriate places. XInclude and XLink (and for SGML, HyTime)
also offer ways to include or locate parts of documents in terms
of parse trees.

But how about the physical storage format of each file? Authors
will often be using different XML or SGML editors that will
'beautify' the XML or SGML source in different ways, introducing
spurious differences and conflicts. Another source of spurious
conflicts are character encoding, namespace declarations, and
order of attributes; most documents can be stored in a number of
different ways with no loss of information for the intended use.
But a simple diff will show a lot of difference that's not there,
essentially.

Until proper XML repositories become as ubiquitous as cvs, we
might as well find a way to live with it.

The character encoding is easy to control -- SGML and XML are
very explicit about it, and editors do in general handle encoding
gracefully.

Namespace declarations and attribute order are tricky. Things
can be normalized, see Canonical XML, http://www.w3.org/TR/xml-c14n,
but full canonicalization of a documents will be too much.

The 'beautify' problem is even worse., i.e., how to introduce and
remove whitespace in a way that makes cvs behave meaningfully.
I have not yet found a simple recipe for beautifying SGML and XML.
Here are some of the options:

The most generic, simple and safe way to break XML or SGML into
lines is unfortunately not too pretty. Keep any line breaks
already present in the source and, in addition, break just _before_
the markup delimiter close character '>' on the start tag, e.g.:

$ osx xml.dcl beautify.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE section PUBLIC
"-//OASIS//DTD DocBook XML V4.2//EN"
"docbookx.dtd">
<section
><title
>Beautifying XML</title><para
>Papageno</para><para
>Break inside markup like
this: <emphasis
role="bold"
>some text</emphasis>.</para><para
>Papagena</para></section>

Some tools can beautify in a way more suitable for human consumption:

$ xmllint --format beautify.xml
<?xml version="1.0" encoding="iso-8859-1?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"docbookx.dtd">
<section>
  <title>Beautifying XML</title>
  <para>Papageno</para>
  <para>Break inside markup like
this: <emphasis role="bold">some text</emphasis>.</para>
  <para>Papagena</para>
</section>

Keeping white space in character context while beautifying is a simple
way to avoid problems with NOTATION linespecific AKA xml:space='preserve'
AKA <pre>. But the reason we needed a beautfier in the first place is
that editors put in different amounts of whitespace in different places.
If someone out there have a nice and robust XSLT stylesheet for
normalizing/beautifying XML, please publish!

There are, BTW, XML diff tools. See e.g.:

  http://www.alphaworks.ibm.com/tech/xmldiffmerge
  http://www.deltaxml.com
  http://www.vmguys.com/vmtools
  http://www.logilab.org/xmldiff

The first one can be used as merge tool. The other ones can produce a
XML diff file that -- given a proper XML patch utility -- can update one
one XML file to become the other one.

There are, to the best of my knowledge, no freely available stand-alone
SGML diff tools. Some editors, e.g. ArborText Epic, can do a very nice
compare.

kind regards,
Peter Ring


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of
Greg A. Woods
Sent: 26. april 2002 23:45
To: CVS-II Discussion Mailing List
Subject: RE: merge mode for XML


<snip>

A better approach is to avoid XML entirely in the first place -- it's a
really really horrid syntax with all kinds of goo that's usually way
over-kill for the application, being SGML based and all that....

</snip>


_______________________________________________
Info-cvs mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/info-cvs

RE: merge mode for XML

Reply via email to