Re: [MarkLogic Dev General] Determining Whether Whitespace is In Data as Stored or A Result of Serialization?

Danny Sokolsky Tue, 29 Nov 2011 16:55:39 -0800

Hi Eliot,

There were some changes made in later 4.2 releases to restore the behavior from 
earlier releases.  The serialization is about how it is output, not how it is 
stored, so it should be stored correctly.


I recommend trying it on the latest 4.2 release (4.2-7 now, I think).  I think 
it will then, by default, behave the same as in 4.1.  In 4.2, there are some 
serialization options you can set at the query level to control this.  In 
MarkLogic 5, you can also control these options' default values at the App 
Server level.

Here is the 4.2 release not item that describes some of these changes:

http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4.2doc/xml/relnotes/chap4.xml%2340996

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Eliot Kimber
Sent: Monday, November 28, 2011 3:04 PM
To: [email protected]
Subject: [MarkLogic Dev General] Determining Whether Whitespace is In Data as 
Stored or A Result of Serialization?

I have determined that content loaded through the XccRunner.load() method
has unwanted whitespace not in the original XML when subsequently accessed
from MarkLogic.

I've tested on 4.2-1. Earlier versions do not seem to have this behavior
(although I need to do more testing to confirm--but we certainly would have
noticed it if we had, as from our standpoint it constitutes a data
corruption issue as data being returned from ML is different from what was
given to ML).

I traced the DOM being loaded right to the call of load() and verified by
inspection that there were no whitespace nodes between two particular
elements, e.g., the original source was:

<parent><child>text</child><child>text</child><parent>

Accessing the loaded document using e.g.,:

doc('/foo/bar/mynewdoc.xml')

Results in:

<parent>
  <child>text</child>
  <child>text</child>
   </parent>

(where there is multiple whitespace before the <child> start tags and before
the </parent> close tag).

I tried various access routes, including CQ, access via our own product's
calls to the XccRunner API, OxygenXML via WebDAV and direct XQuery (via Xcc)
and get the same result. Some accesses show more indention than others, but
they all have indention.

>From what I could find it appears that this is the result of a change in the
default serialization options.
  
My primary question is: how can I determine how the XML is stored in ML
without interference from any serialization options? Assuming the ML is not
literally storing the bytes of the ML, I assume I can't just look inside the
forest, but is there a reliable way to see what the original whitespace was?
My first task is to prove that the ML is correct as provided to MarkLogic.

My secondary questions:

1. Is there any way that options on the load() method could affect
whitespace as stored? I didn't see any but I could have missed something.

2. If this is in fact a function of serialization options, where would we
control that in our Java code that uses Xcc to run XQueries? Is it simply a
matter of adding "declare option xdmp:output indent=no;" to our XQuery
modules?

3. Is this default serialization behavior changed in ML 5?

Thanks,

Eliot

-- 
Eliot Kimber
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.reallysi.com
www.rsuitecms.com

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Determining Whether Whitespace is In Data as Stored or A Result of Serialization?

Reply via email to