The XCC docs indicate that the BOM is only supported for UTF8 I would suggest transcoding the XML file to UTF8 before uploading. That could be done 'on the fly' with some creative code using a Reader and Writer to a ByteArrayOutputStream then creating an ByteArrayInputStream from the result, or by using a standalone app and transcoding the XML on the filesystem.
----------------------------------------------------------------------------- David Lee Lead Engineer MarkLogic Corporation [email protected] Phone: +1 650-287-2531 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. From: [email protected] [mailto:[email protected]] On Behalf Of Geert Josten Sent: Thursday, February 09, 2012 1:32 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] BOM char and UTF-16 Hi Josh, Your first line doesn't show where the BOM is located. It should be the first two characters of the first line. Note: the encoding attribute in the XML pi, doesn't ensure it really is written in that encoding, though is a strong suggestion usually. Particularly if the file is written with XML tools/libraries. Not sure either MarkLogic handles the BOM well, but I did think so. I thought I uploaded UTF-8 files with BOM without problems. But changing the encoding of the file on the fly to match that of the MarkLogic app server setting is a good workaround too I guess. Kind regards, Geert Van: [email protected]<mailto:[email protected]> [mailto:[email protected]<mailto:[email protected]>] Namens Josh Warner-Burke Verzonden: woensdag 8 februari 2012 22:49 Aan: [email protected]<mailto:[email protected]> Onderwerp: [MarkLogic Dev General] BOM char and UTF-16 I emailed about a week ago about a problem I was having with XCC and large files. I got some very good advice which said I needed to use session.insertContent to get the file in. I'm done with that conversion but dealing with the resulting problems due to the change. What I'm looking at right now is a file that is UTF-16 and begins with two BOM characters - which I have learned are actually relevant in telling any string parser/consumer what order the bytes in each pair will be... I wrote some code that strips out the BOMs but it seems to screw the encoding up altogether. I also put in code to set the encoding to UTF16 in the ContentCreateOptions. Without stripping BOMs, I get this: Invalid root text "ÿþ" at [uri] line 1 To deal with UTF-16 don't you *need those BOMs? What am I missing here? FYI the first line of the files looks like: <?xml version="1.0" encoding="UTF-16" standalone="yes"?> So it's clearly utf-16. There is some leeway in terms of how I create the Content object to feed to insertContent - currently I'm treating it as a byte[] - but I could do string conversion etc if that's what I need to do. Any help is appreciated. -- Josh Warner-Burke 42SIX Solutions (m): 410-493-4362 (e): [email protected]<mailto:[email protected]> http://www.42six.com<http://www.42six.com/>
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
