Another function you can try for this approach is xdmp:subbinary: http://developer.marklogic.com/pubs/4.0/apidocs/Extension.html#xdmp:subbinary
-Danny -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Geert Josten Sent: Thursday, March 26, 2009 3:58 AM To: General Mark Logic Developer Discussion Subject: RE: [MarkLogic Dev General] Importing xml with unpredictable encoding Hi Danny, Unfortunately these documents do contain non-utf-8 characters. Just using a common accented e of some kind already breaks the xdmp:document-get. So, that leaves reading it as binary.. I first thought that there were no functions that were able to do anything with binary content, but then I stumbled across the fn:data function. I started fumbling around with tokenize and codepoints-to-string. All this resulted in the following. I bet there are other users interested in this code, since it allows capturing other information from the prolog as well. ---- declare function local:filesystem-file-head($path as xs:string, $length as xs:integer) as xs:string { let $half-bytes := tokenize(substring(string(data(xdmp:document-get($path, <options xmlns="xdmp:document-get"> <format>binary</format> <repair>none</repair> </options>))), 1, $length * 2), "") let $bytes := for $half-byte at $pos in $half-bytes where ($pos mod 2) = 0 return xdmp:hex-to-integer(concat($half-byte, $half-bytes[$pos + 1])) return codepoints-to-string($bytes) }; declare function local:filesystem-file-xmldecl($path as xs:string) as xs:string { let $prolog := local:filesystem-file-head($path, 100) return if (contains($prolog, "<?xml") and contains($prolog, "?>")) then concat(substring-before(concat("<?xml", substring-after($prolog, "<?xml")), "?>"), "?>") else '' }; declare function local:filesystem-file-encoding($path as xs:string) as xs:string { let $xml-decl := local:filesystem-file-xmldecl($path) return if (matches($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$")) then replace($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$", "$1") else if (matches($xml-decl, '<\?.*\sencoding="([^"]*)".*>$')) then replace($xml-decl, '<\?.*\sencoding="([^"]*)".*>$', "$1") else "UTF-8" }; let $path := "c:\temp\test.xml" let $xml-decl:= local:filesystem-file-xmldecl($path) let $encoding := local:filesystem-file-encoding($path) let $response-decl := replace($xml-decl, $encoding, "utf-8") return ( xdmp:set-response-content-type("application/xml; charset=utf-8"), $response-decl, xdmp:document-get($path, <options xmlns="xdmp:document-get"><format>xml</format><repair>none</repair><encoding>{$encoding}</encoding></options>) ) ---- Create c:\temp\test.xml containing something like: <?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?> <TEST>Fûnny cháràctërs</TEST> Do make sure to save it with the proper encoding. (Storing as Ansi will work as well) The code should work out of the box in CQ.. Not the nicest code, I am aware of that (suggestions to enhance this code are welcome), but at least it works.. :-/ Kind regards, Geert > Drs. G.P.H. Josten Consultant http://www.daidalos.nl/ Daidalos BV Source of Innovation Hoekeindsehof 1-4 2665 JZ Bleiswijk Tel.: +31 (0) 10 850 1200 Fax: +31 (0) 10 850 1199 http://www.daidalos.nl/ KvK 27164984 De informatie - verzonden in of met dit emailbericht - is afkomstig van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen rechten worden ontleend. > From: [email protected] > [mailto:[email protected]] On Behalf Of > Danny Sokolsky > Sent: woensdag 25 maart 2009 23:49 > To: General Mark Logic Developer Discussion > Subject: RE: [MarkLogic Dev General] Importing xml with > unpredictable encoding > > It is a bit hacky, but you can try to do an xdmp:document-get > as text, then peek into the text to grab the encoding, then > do an xdmp:document-get on it. Something like (this is > probably not that robust, but it gives the idea): > > let $path := "c:/tmp/test.xml" > let $encoding := > let $text := > xdmp:document-get($path, > <options xmlns="xdmp:document-get"> > <format>text</format> > <repair>none</repair> > </options>) > let $enc := fn:substring-after( > fn:substring-before($text, '"?>'), > 'encoding="') > return > $enc > return > xdmp:document-load($path, > <options xmlns="xdmp:document-load"> > <uri>/mydoc.xml</uri> > <format>xml</format> > <repair>none</repair> > <encoding>{$encoding}</encoding> > </options>) > > > This will not work if there are non-utf-8 characters though, > as the xdmp:document-get would throw an exception. And yes, > it would be nice if the server had this capability built in. > But for now there are lots of ways around it. > > -Danny > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Geert Josten > Sent: Wednesday, March 25, 2009 2:37 PM > To: General Mark Logic Developer Discussion > Subject: RE: [MarkLogic Dev General] Importing xml with > unpredictable encoding > > Hi Danny, > > Are there ways to pre-read the document as a string or binary > (from Xquery), get the encoding from the declaration by using > straigh forward functions, and use that as the value for the > encoding option to a call to xdmp:document-get to read the > document with the correct encoding? > > I could pre-parse the files outside MarkLogic Server, or rely > on things like MLJAM, but I would prefer not needing to. > > Has it been considered to do support the xml declaration for > this purpose, for instance when the xdmp:document-get was > called without an explicit encoding option? If not, would you > be willing to consider such addition? I really think it would > improve the value. > > Kind regards, > Geert > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of Danny > > Sokolsky > > Sent: woensdag 25 maart 2009 16:43 > > To: General Mark Logic Developer Discussion > > Subject: RE: [MarkLogic Dev General] Importing xml with > unpredictable > > encoding > > > > Hi Geert, > > > > You can specify the encoding with the <encoding> option to > > xdmp:document-get or xdmp:document-load. You do have to know the > > encoding though--it will not use an encoding in a header of the > > document on its own, and will default to UTF-8. > > > > -Danny > > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of Geert > > Josten > > Sent: Wednesday, March 25, 2009 6:07 AM > > To: General Mark Logic Developer Discussion > > Subject: [MarkLogic Dev General] Importing xml with unpredictable > > encoding > > > > Hi, > > > > Is it correct that the MarkLogic built-in functions > xdmp:document-load > > and xdmp:document-get do not respect the encoding > specification in the > > XML declaration? They expect > > UTF-8 by default and otherwise try to consume the file with the > > encoding specified in the options. Is there a way to > anticipate on the > > encoding in the XML declaration? > > > > I tried using something like xdmp:filesystem-file and (rather > > ugly) try parsing the string with string functions, but it > chokes with > > the message that the string contains a bad codepoint > (SVC-BAD: ... -- > > Bad CodepointIterator::_next). > > > > Any ideas? > > > > Kind regards, > > Geert _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
