Daniel John Debrunner wrote:
I was thinking more generally in that an XML value may be generated and
thus never have been stored to disk. How it is stored on disk and how
the XML value is serialized using XMLSERIALIZE() are different
operations, it's just an implementation detail of derby that they are
the same in some instances.
Okay, that makes sense. Sorry for not grasping this earlier.
Would all these operations return the same exact characters to an
application if they represent the same logical value?
XMLSERIALIZE(colvalue originally on linux)
XMLSERIALIZE(colvalue originally on windows)
XMLSERIALIZE(generated XML value from other XML operators)
I'm assuming the following definitions for this question:
- let "colvalue" represent the logical value
- let "colvalue originally on linux" be the result of inserting
<colvalue> on a Linux machine
- let "colvalue originally on windows" be the result of inserting
<colvalue> on a Windows machine
- let "n" be the number of characters (including line breaks) in
<colvalue>.
- let <nl> be the number of line breaks in <colvalue>
If this is correct, then the answer to the question is No, the above three
operations would not return the same exact characters. The result of the first
operation will have (n) characters in it. The result of the second operation
will have one more character ("\r") in it for every line break in "colvalue";
i.e. it will have (n + nl) characters in it. And the result of the third
operation will have (n + nl) characters if executed on Windows, but only (n)
characters if executed on Linux.
Note that once inserted, serialization of a specific row will return the same
characters regardless of whether the XMLSERIALIZE is executed on Windows or
Linux. Or put another way, the result of the first operation will always return
(n) characters, regardless of platform. Similarly, the result of the second
operation will always return (n + nl) characters.
Would it surprise an application to receive different character values
for those expressions?
Good question. I did some searching around on the Xalan/Xerces Jira issues and
the general notion seems to be that XML "output" (which I presume includes the
result of XML serialization) can convert the newline character to the
platform-specific newline. See esp. Joe Kesselman's comments on XALANJ-1137.
This leads me to believe that there is truth to what Bryan Pendleton said in his
reply to the question, namely:
- carefully written XML applications should not be affected by this
If the expectation (as apparently backed by the XML spec) is that "output" can
have platform-specific newlines, then it seems like an application written to
process XML data should not be surprised by this behavior. And that in a way
leads to the next question:
If they are different, does it matter since they are all valid
serializations under SQL/XML?
Presumably no, it does not (or at least, should not) matter. But having said
that, I cannot help but nod in agreement when I read the following:
My gut feeling is that different character values would be confusing to
an application, but it probably depends what the application is doing
with them. Looking at them in notepad would be confusing. :-)
Given that the relevant specs seem to indicate that it is valid to return
platform-specific endings and it is *also* valid to just return "\n", and given
that the latter option strikes me as potentially less confusing to the app, I
tend to the lean toward the less confusing option. Of course, a lot of that has
to do with the fact that the latter option is pretty easily implemented in the
code. I made the following addition to the end of the "serializeToString()"
method in SqlXmlUtil.java and was able to get consistent results (i.e. exactly
the same characters) across platforms:
+ String eol = PropertyUtil.getSystemProperty("line.separator");
+ if (eol != null)
+ return sWriter.toString().replaceAll(eol, "\n");
return sWriter.toString();
Downside is a potential performance hit for large XML docs, which may not be
worth it. Note, though, that the implementation as a whole is not very ideal
for large XML documents because it (already) materializes the entire document
into memory. This continues to be a fish for any idle cooks to fry...
Thinking a little more, having XMLSERIALIZE() (within an given runtime)
being non-deterministic seems wrong.
When you write "within a given runtime", what is the definition of "runtime"?
Is that a specific JVM instance on a specific machine, or is it "Derby" on a
more general level? Or something else entirely? Is the behavior that I
described above (i.e. different characters depending on which platform
originally inserted <colvalue>) considered non-deterministic?
I find myself agreeing with both Dan and Bryan on this, and for that reason I
tend to believe the following:
(to quote Bryan):
- it's not a bug in Derby that the serialization can differ in
details like this
- carefully written XML applications should not be affected by this
- it is reasonable to adjust the test to avoid hitting this problem.
(and as an additional thought):
- Given that there is at least one potentially simple "enhancement" to
Derby that could resolve the issue within the engine instead of the
within the test, it is *also* reasonable--and perhaps preferable--to
make that change in the engine so that we can (hopefully) reduce the
likelihood of confusing applications that use XML data in Derby.
This would also ensure deterministic (so far as I understand it)
behavior of XMLSERIALIZE across platforms.
Any additional thoughts/suggestions/corrections?
Thanks to Dan, Jean, and Bryan for taking the time to reply thus far...
Army