Re: [MarkLogic Dev General] How to force EOL characters when downloading a text file

David Lee Mon, 01 Sep 2014 17:44:32 -0700

Tim, could you show a (shortened if you want, but complete) example of your 
query ,
and how you invoke it and how you are getting the results ?


The behavior you describe is *probably* the serialization of XDM to Text as 
described  here
http://www.w3.org/TR/xquery-30/#id-serialization  (or)
http://www.w3.org/TR/xslt-xquery-serialization/
( rather obtusely until you learn how to decipher W3C specification documents).

A critical issue is that the conversion of bare "text nodes" to "text"  is part 
of the serialization process,
not part of the node construction.   Node construction with multiple children 
does *not* add newlines ...
(it adds spaces - see below)
Serialization *may* add newlines,  depending on exactly how your are 
constructing your document, where you are sending the output and what software 
and settings are used to eventually get it to where you see it.

I suspect you are outputting *only* text nodes ... which means the result is a 
"Sequence of Nodes"
and falls under the category of (5.2.7 Serialization Feature)
which is full of "may"s and "musts" and "implementation-defined"
But in general most XDM (the result of an XQuery or XSLT) processors that 
produce text are consistent
and if not told otherwise (via various output method declarations, command line 
overrides, API settings etc.)
does this

For every item in the result
   Convert that item to a "string" (using the serialization or atomization 
rules for that item)
   Output that string followed by a newline

( it takes about 20 pages to distill to this ... but in your case the critical 
part is if you are producing
a sequence of items or a single item that wraps a sequence.
A sequence will be newline separated.

Why ? Because text nodes are treated differently during element construction 
then they are by themselves.
During element construction adjacent text nodes are combined (without any 
separation).
(http://www.w3.org/TR/xquery-30/ , 3.9.1.3 Content,
"Adjacent text nodes in the content sequence are merged into a single text node 
by concatenating their contents, with no intervening blanks. After 
concatenation, any text node whose content is a zero-length string is deleted 
from the content sequence."
)

If you then serialize the element it won't have any extra spaces.
BUT ... if your XQuery produces a sequence of values (strings, dates, nodes, 
whatever)
then each item *during serialization* is individually serialized, and depending 
on the processor likely
newline separated.

Try this


<e>{
  text {"a"}, text{"string"}, text{"is"} , text{"here"}
}</e>


You should get something like this
<e>astringishere</e>

may vary depending on various settings but will NOT put a newline between the 
text nodes.
The point here is this is ONE item result (an element)

Now try this

text {"a"}, text{"string"}, text{"is"} , text{"here"}

You should get something like this:
a
string
is
here


Note: this is a sequence of FOUR items each serialized then followed by a NL.

While you're at it, you might as well discover you probably don’t need the 
text{} ... which creates *nodes*,
if what you want is just strings then converting them to nodes is unnecessary, 
even if you want them as a child of an element.    The rules for combining 
multiple text (or strings or other atomic values) is different ..
in this case it follows the element construction rules:
http://www.w3.org/TR/xquery-30/#id-content  (sec 3.9.1.3)
"For each adjacent sequence of one or more atomic values returned by an 
enclosed expression, a new text node is constructed, containing the result of 
casting each atomic value to a string, with a single space character inserted 
between adjacent values.
"

So try this:


<e>{
  "a", "string", "is" , "here"
}</e>

What do you get ?

<e>a string is here</e>

Different ! ... and often baffling to people until they figure out whats going 
on.
This holds true even if you extract the text back out of the node.

like:
<e>{ "a", "string", "is" , "here"}</e>/string()
or
<e>{ "a", "string", "is" , "here"}</e>/node()

A way to double check is to count the results... now many values in the above ? 
4 ?
nope, 1.
count(<e>{ "a", "string", "is" , "here"}</e>/node())
count(("a", "string", "is" , "here"))  --- Note I had to enclose the sequence 
in() ...

4



Now if you don’t create an element, and do this directly:


"a", "string", "is" , "here"

What do you get ?   No element constructor rules so were back to the 
serialization of multiple items ..
so you get

a
string
is
here

This is why concat and string-join (and in V7 and later the || operator) make a 
difference.

"A" || "string" || "is" || "here
concat("a", "string", "is" , "here")
string-join( ("a", "string", "is" , "here") , "" )

All produce 1 item (string) with no separators.
Same is true if you get fancy like

string-join(
  for $i in 1 to 1000
   return concat( "a" , "big" , "runon" , "string", "#" , $i ,
     string-join(("these","are","colon","separated" ),":" ) , "" )

But stick that in an element instead of string joining and it’s a tad different

<e>{ for $i in 1 to 1000
   return concat( "a" , "big" , "runon" , "string", "#" , $i ,
     string-join(("these","are","colon","separated" ),":" ) }</e>

Or outside an element ...

for $i in 1 to 1000
   return concat( "a" , "big" , "runon" , "string", "#" , $i ,
     string-join(("these","are","colon","separated" ),":" )


All different, but once you get the rules its quite predictable, and maybe even 
sane.

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>











From: [email protected] 
[mailto:[email protected]] On Behalf Of Tim
Sent: Monday, September 01, 2014 12:24 PM
To: 'MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] How to force EOL characters when downloading a 
text file

Hi Folks,

I am extracting text from an xml file which can be downloaded by a user. The 
file extension is custom. To create the record I basically walk through the XML 
elements and generate the corresponding text, e.g.

                text{“first line”},
                text{“second line”},
                …

When the user downloads the file, the Windows form of linefeeds are required 
(CR-LF) and I’m trying to determine how to force that, if it is in the content 
type, disposition, or merely in the way in which I add linefeeds to the 
generate text, e.g.

                text{“first line”}, “&#x0D;”, “&#x0A;”
                text{“second line”}, “&#x0D;”, “&#x0A;”
                …

It seems that using the text{“”} directive adds the linefeed character to the 
generated text without explicitly adding CR-LF.

Thank for any help with this!

Tim M.

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to force EOL characters when downloading a text file

Reply via email to