As part of my work for DERBY-1758 I'm looking at the XML binding test
(lang/xmlBinding.java in the old harness, lang/XMLBindingTest.java in JUnit) and
I noticed that the test, which counts characters as a simple sanity check for
insertion of docs larger than 32k, returns different results on Linux vs
Windows. (Actually, Bryan Pendleton was the first one to notice this a while
back when he was reviewing DERBY-688 changes).
Long story short, Xalan serialization (which is what Derby uses to serialize XML
documents) inserts platform-specific line-endings (based on the "line.separator"
System property) into XML documents for every newline. This appears to be
technically valid, so it is not a bug per se [1]. However, from a Derby
perspective this means that someone who inserts the exact same XML document into
an XML column on Windows vs on Linux will actually be inserting more characters
in the former case than in the latter (because the Windows line separator is two
characters). Or put differently, when inserting an XML document on Windows an
extra character is written to disk for every line in the XML document. This
does *not* happen with other character types (ex. CLOB).
My question, then, is this: Is it considered a "bug" in Derby if insertion of
the same XML value by the user can lead to different data (namely, line ending
characters) being written to disk for different platforms?
There appear to be two obvious ways to get around this problem: 1) add logic in
Derby engine to take the result of Xalan serialization and replace
platform-specific line-endings with "\n", or 2) change the XML binding test to
always count line-endings as a single "character" for the sake of asserting
character counts.
I'm leaning toward option 1, but am not particularly driven one way or the
other. If the answer to my above question is "Yes, it's a bug", then option 1
is clearly the only option; otherwise option 2 makes the test pass and is easy
to implement. It does a feel a tad like cheating, though...
Comments/feedback are appreciated, if anyone has any.
Thanks,
Army
----
[1]
I searched Jira for this and found a couple of relevant Xalan issues, especially
XALANJ-2093 and XALANJ-1701. There is apparently a new property introduced in
Xalan 2.7 to allow the user to indicate what should happen with newlines, but
that property is non-standard and would require Derby to use Xalan 2.7 in order
to build.
Based on comments in the aforementioned XALANJ issues it looks like it is
technically valid for Xalan to convert the newlines to platform-specific
endings. This seems to agree with the following quote from the w3c page on
serialization:
http://www.w3.org/TR/xslt-xquery-serialization/#serdm:
"When outputting a newline character in the instance of the data model, the
serializer is free to represent it using any character sequence that will be
normalized to a newline character by an XML parser, unless a specific mapping
for the newline character is provided in a character map (see 9 Character Maps)."
I don't know what Xalan serialization does with character maps, but there is
nothing explicit in Derby to specify use of such maps, so my (admittedly
lacking) understanding is that it's okay for Xalan to return platform-specific
line-endings when serializing.