Writing platform-specific line-endings to disk...

Army Fri, 17 Nov 2006 15:04:56 -0800

As part of my work for DERBY-1758 I'm looking at the XML binding test(lang/xmlBinding.java in the old harness, lang/XMLBindingTest.java in JUnit) andI noticed that the test, which counts characters as a simple sanity check forinsertion of docs larger than 32k, returns different results on Linux vsWindows. (Actually, Bryan Pendleton was the first one to notice this a whileback when he was reviewing DERBY-688 changes).

Long story short, Xalan serialization (which is what Derby uses to serialize XMLdocuments) inserts platform-specific line-endings (based on the "line.separator"System property) into XML documents for every newline. This appears to betechnically valid, so it is not a bug per se [1]. However, from a Derbyperspective this means that someone who inserts the exact same XML document intoan XML column on Windows vs on Linux will actually be inserting more charactersin the former case than in the latter (because the Windows line separator is twocharacters). Or put differently, when inserting an XML document on Windows anextra character is written to disk for every line in the XML document. Thisdoes *not* happen with other character types (ex. CLOB).

My question, then, is this: Is it considered a "bug" in Derby if insertion ofthe same XML value by the user can lead to different data (namely, line endingcharacters) being written to disk for different platforms?

There appear to be two obvious ways to get around this problem: 1) add logic inDerby engine to take the result of Xalan serialization and replaceplatform-specific line-endings with "\n", or 2) change the XML binding test toalways count line-endings as a single "character" for the sake of assertingcharacter counts.

I'm leaning toward option 1, but am not particularly driven one way or theother. If the answer to my above question is "Yes, it's a bug", then option 1is clearly the only option; otherwise option 2 makes the test pass and is easyto implement. It does a feel a tad like cheating, though...


Comments/feedback are appreciated, if anyone has any.

Thanks,
Army

----

[1]

I searched Jira for this and found a couple of relevant Xalan issues, especiallyXALANJ-2093 and XALANJ-1701. There is apparently a new property introduced inXalan 2.7 to allow the user to indicate what should happen with newlines, butthat property is non-standard and would require Derby to use Xalan 2.7 in orderto build.

Based on comments in the aforementioned XALANJ issues it looks like it istechnically valid for Xalan to convert the newlines to platform-specificendings. This seems to agree with the following quote from the w3c page onserialization:


http://www.w3.org/TR/xslt-xquery-serialization/#serdm:

"When outputting a newline character in the instance of the data model, theserializer is free to represent it using any character sequence that will benormalized to a newline character by an XML parser, unless a specific mappingfor the newline character is provided in a character map (see 9 Character Maps)."

I don't know what Xalan serialization does with character maps, but there isnothing explicit in Derby to specify use of such maps, so my (admittedlylacking) understanding is that it's okay for Xalan to return platform-specificline-endings when serializing.

Writing platform-specific line-endings to disk...

Reply via email to