Re: [MarkLogic Dev General] BOM char and UTF-16

David Lee Thu, 09 Feb 2012 10:45:32 -0800

The XCC docs indicate that the BOM is only supported for UTF8
I would suggest transcoding the XML file to UTF8 before uploading.
That could be done 'on the fly' with some creative code using a Reader and 
Writer to a ByteArrayOutputStream then creating an ByteArrayInputStream from 
the result, or by using a standalone app and transcoding the XML on the 
filesystem.



-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]
Phone: +1 650-287-2531
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

This e-mail and any accompanying attachments are confidential. The information 
is intended solely for the use of the individual to whom it is addressed. Any 
review, disclosure, copying, distribution, or use of this e-mail communication 
by others is strictly prohibited. If you are not the intended recipient, please 
notify us immediately by returning this message to the sender and delete all 
copies. Thank you for your cooperation.

From: [email protected] 
[mailto:[email protected]] On Behalf Of Geert Josten
Sent: Thursday, February 09, 2012 1:32 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] BOM char and UTF-16

Hi Josh,

Your first line doesn't show where the BOM is located. It should be the first 
two characters of the first line. Note: the encoding attribute in the XML pi, 
doesn't ensure it really is written in that encoding, though is a strong 
suggestion usually. Particularly if the file is written with XML 
tools/libraries. Not sure either MarkLogic handles the BOM well, but I did 
think so. I thought I uploaded UTF-8 files with BOM without problems.

But changing the encoding of the file on the fly to match that of the MarkLogic 
app server setting is a good workaround too I guess.

Kind regards,
Geert

Van: 
[email protected]<mailto:[email protected]>
 
[mailto:[email protected]<mailto:[email protected]>]
 Namens Josh Warner-Burke
Verzonden: woensdag 8 februari 2012 22:49
Aan: [email protected]<mailto:[email protected]>
Onderwerp: [MarkLogic Dev General] BOM char and UTF-16

I emailed about a week ago about a problem I was having with XCC and large 
files.  I got some very good advice which said I needed to use 
session.insertContent to get the file in.  I'm done with that conversion but 
dealing with the resulting problems due to the change.

What I'm looking at right now is a file that is UTF-16 and begins with two BOM 
characters - which I have learned are actually relevant in telling any string 
parser/consumer what order the bytes in each pair will be...

I wrote some code that strips out the BOMs but it seems to screw the encoding 
up altogether.  I also put in code to set the encoding to UTF16 in the 
ContentCreateOptions.  Without stripping BOMs, I get this:
Invalid root text "&#255;&#254;" at [uri] line 1

To deal with UTF-16 don't you *need those BOMs?  What am I missing here?  FYI 
the first line of the files looks like:

<?xml version="1.0" encoding="UTF-16" standalone="yes"?>

So it's clearly utf-16.

There is some leeway in terms of how I create the Content object to feed to 
insertContent - currently I'm treating it as a byte[] - but I could do string 
conversion etc if that's what I need to do.  Any help is appreciated.

--
Josh Warner-Burke
42SIX Solutions
(m): 410-493-4362
(e): [email protected]<mailto:[email protected]>
http://www.42six.com<http://www.42six.com/>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] BOM char and UTF-16

Reply via email to