As part of my work for DERBY-1758 I'm looking at the XML binding test (lang/xmlBinding.java in the old harness, lang/XMLBindingTest.java in JUnit) and I noticed that the test, which counts characters as a simple sanity check for insertion of docs larger than 32k, returns different results on Linux vs Windows. (Actually, Bryan Pendleton was the first one to notice this a while back when he was reviewing DERBY-688 changes).

Long story short, Xalan serialization (which is what Derby uses to serialize XML documents) inserts platform-specific line-endings (based on the "line.separator" System property) into XML documents for every newline. This appears to be technically valid, so it is not a bug per se [1]. However, from a Derby perspective this means that someone who inserts the exact same XML document into an XML column on Windows vs on Linux will actually be inserting more characters in the former case than in the latter (because the Windows line separator is two characters). Or put differently, when inserting an XML document on Windows an extra character is written to disk for every line in the XML document. This does *not* happen with other character types (ex. CLOB).

My question, then, is this: Is it considered a "bug" in Derby if insertion of the same XML value by the user can lead to different data (namely, line ending characters) being written to disk for different platforms?

There appear to be two obvious ways to get around this problem: 1) add logic in Derby engine to take the result of Xalan serialization and replace platform-specific line-endings with "\n", or 2) change the XML binding test to always count line-endings as a single "character" for the sake of asserting character counts.

I'm leaning toward option 1, but am not particularly driven one way or the other. If the answer to my above question is "Yes, it's a bug", then option 1 is clearly the only option; otherwise option 2 makes the test pass and is easy to implement. It does a feel a tad like cheating, though...

Comments/feedback are appreciated, if anyone has any.

Thanks,
Army

----

[1]

I searched Jira for this and found a couple of relevant Xalan issues, especially XALANJ-2093 and XALANJ-1701. There is apparently a new property introduced in Xalan 2.7 to allow the user to indicate what should happen with newlines, but that property is non-standard and would require Derby to use Xalan 2.7 in order to build.

Based on comments in the aforementioned XALANJ issues it looks like it is technically valid for Xalan to convert the newlines to platform-specific endings. This seems to agree with the following quote from the w3c page on serialization:

http://www.w3.org/TR/xslt-xquery-serialization/#serdm:

"When outputting a newline character in the instance of the data model, the serializer is free to represent it using any character sequence that will be normalized to a newline character by an XML parser, unless a specific mapping for the newline character is provided in a character map (see 9 Character Maps)."

I don't know what Xalan serialization does with character maps, but there is nothing explicit in Derby to specify use of such maps, so my (admittedly lacking) understanding is that it's okay for Xalan to return platform-specific line-endings when serializing.

Reply via email to