Daniel John Debrunner wrote:
I was thinking more generally in that an XML value may be generated and thus never have been stored to disk. How it is stored on disk and how the XML value is serialized using XMLSERIALIZE() are different operations, it's just an implementation detail of derby that they are the same in some instances.

Okay, that makes sense.  Sorry for not grasping this earlier.

Would all these operations return the same exact characters to an application if they represent the same logical value?

XMLSERIALIZE(colvalue originally on linux)
XMLSERIALIZE(colvalue originally on windows)
XMLSERIALIZE(generated XML value from other XML operators)

I'm assuming the following definitions for this question:

  - let "colvalue" represent the logical value
  - let "colvalue originally on linux" be the result of inserting
    <colvalue> on a Linux machine
  - let "colvalue originally on windows" be the result of inserting
    <colvalue> on a Windows machine
  - let "n" be the number of characters (including line breaks) in
    <colvalue>.
  - let <nl> be the number of line breaks in <colvalue>

If this is correct, then the answer to the question is No, the above three operations would not return the same exact characters. The result of the first operation will have (n) characters in it. The result of the second operation will have one more character ("\r") in it for every line break in "colvalue"; i.e. it will have (n + nl) characters in it. And the result of the third operation will have (n + nl) characters if executed on Windows, but only (n) characters if executed on Linux.

Note that once inserted, serialization of a specific row will return the same characters regardless of whether the XMLSERIALIZE is executed on Windows or Linux. Or put another way, the result of the first operation will always return (n) characters, regardless of platform. Similarly, the result of the second operation will always return (n + nl) characters.

Would it surprise an application to receive different character values for those expressions?

Good question. I did some searching around on the Xalan/Xerces Jira issues and the general notion seems to be that XML "output" (which I presume includes the result of XML serialization) can convert the newline character to the platform-specific newline. See esp. Joe Kesselman's comments on XALANJ-1137. This leads me to believe that there is truth to what Bryan Pendleton said in his reply to the question, namely:

 - carefully written XML applications should not be affected by this

If the expectation (as apparently backed by the XML spec) is that "output" can have platform-specific newlines, then it seems like an application written to process XML data should not be surprised by this behavior. And that in a way leads to the next question:

If they are different, does it matter since they are all valid serializations under SQL/XML?

Presumably no, it does not (or at least, should not) matter. But having said that, I cannot help but nod in agreement when I read the following:

My gut feeling is that different character values would be confusing to an application, but it probably depends what the application is doing with them. Looking at them in notepad would be confusing. :-)

Given that the relevant specs seem to indicate that it is valid to return platform-specific endings and it is *also* valid to just return "\n", and given that the latter option strikes me as potentially less confusing to the app, I tend to the lean toward the less confusing option. Of course, a lot of that has to do with the fact that the latter option is pretty easily implemented in the code. I made the following addition to the end of the "serializeToString()" method in SqlXmlUtil.java and was able to get consistent results (i.e. exactly the same characters) across platforms:

+        String eol = PropertyUtil.getSystemProperty("line.separator");
+        if (eol != null)
+            return sWriter.toString().replaceAll(eol, "\n");
         return sWriter.toString();

Downside is a potential performance hit for large XML docs, which may not be worth it. Note, though, that the implementation as a whole is not very ideal for large XML documents because it (already) materializes the entire document into memory. This continues to be a fish for any idle cooks to fry...

Thinking a little more, having XMLSERIALIZE() (within an given runtime) being non-deterministic seems wrong.

When you write "within a given runtime", what is the definition of "runtime"? Is that a specific JVM instance on a specific machine, or is it "Derby" on a more general level? Or something else entirely? Is the behavior that I described above (i.e. different characters depending on which platform originally inserted <colvalue>) considered non-deterministic?

I find myself agreeing with both Dan and Bryan on this, and for that reason I tend to believe the following:

(to quote Bryan):

  - it's not a bug in Derby that the serialization can differ in
    details like this
  - carefully written XML applications should not be affected by this
  - it is reasonable to adjust the test to avoid hitting this problem.

(and as an additional thought):

  - Given that there is at least one potentially simple "enhancement" to
    Derby that could resolve the issue within the engine instead of the
    within the test, it is *also* reasonable--and perhaps preferable--to
    make that change in the engine so that we can (hopefully) reduce the
    likelihood of confusing applications that use XML data in Derby.
    This would also ensure deterministic (so far as I understand it)
    behavior of XMLSERIALIZE across platforms.

Any additional thoughts/suggestions/corrections?

Thanks to Dan, Jean, and Bryan for taking the time to reply thus far...

Army

Reply via email to