Hi Danny,
Interesting. I hadn't noticed that function yet. :-)
It does sound more efficient to first limit the data and then call data() and
string(), but it won't accept document { binary { ...} } (the returned value
from the xdmp:document-get), so I have to add a /node() to get rid of the
wrapping document node. I have the impression that sticking to substring is
performing just a little bit better for this reason, but it is difficult to
measure that accurately..
Kind regards,
Geert
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Danny Sokolsky
> Sent: donderdag 26 maart 2009 18:35
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> Another function you can try for this approach is xdmp:subbinary:
>
> http://developer.marklogic.com/pubs/4.0/apidocs/Extension.html
> #xdmp:subbinary
>
> -Danny
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Geert Josten
> Sent: Thursday, March 26, 2009 3:58 AM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> Hi Danny,
>
> Unfortunately these documents do contain non-utf-8
> characters. Just using a common accented e of some kind
> already breaks the xdmp:document-get. So, that leaves reading
> it as binary..
>
>
> I first thought that there were no functions that were able
> to do anything with binary content, but then I stumbled
> across the fn:data function. I started fumbling around with
> tokenize and codepoints-to-string. All this resulted in the
> following. I bet there are other users interested in this
> code, since it allows capturing other information from the
> prolog as well.
>
> ----
>
> declare function local:filesystem-file-head($path as
> xs:string, $length as xs:integer) as xs:string { let
> $half-bytes := tokenize(substring(string(data(xdmp:document-get($path,
> <options xmlns="xdmp:document-get">
> <format>binary</format>
> <repair>none</repair>
> </options>))), 1, $length * 2), "") let $bytes :=
> for $half-byte at $pos in $half-bytes
> where ($pos mod 2) = 0
> return
> xdmp:hex-to-integer(concat($half-byte, $half-bytes[$pos + 1]))
>
> return
> codepoints-to-string($bytes)
>
> };
>
> declare function local:filesystem-file-xmldecl($path as
> xs:string) as xs:string {
> let $prolog :=
> local:filesystem-file-head($path, 100)
> return
> if (contains($prolog, "<?xml") and contains($prolog,
> "?>")) then
> concat(substring-before(concat("<?xml",
> substring-after($prolog, "<?xml")), "?>"), "?>")
> else
> ''
> };
>
> declare function local:filesystem-file-encoding($path as
> xs:string) as xs:string {
> let $xml-decl :=
> local:filesystem-file-xmldecl($path)
> return
> if (matches($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$")) then
> replace($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$", "$1")
> else if (matches($xml-decl, '<\?.*\sencoding="([^"]*)".*>$')) then
> replace($xml-decl, '<\?.*\sencoding="([^"]*)".*>$', "$1")
> else
> "UTF-8"
> };
>
> let $path := "c:\temp\test.xml"
> let $xml-decl:= local:filesystem-file-xmldecl($path)
> let $encoding := local:filesystem-file-encoding($path)
> let $response-decl := replace($xml-decl, $encoding, "utf-8") return (
> xdmp:set-response-content-type("application/xml; charset=utf-8"),
> $response-decl,
> xdmp:document-get($path, <options
> xmlns="xdmp:document-get"><format>xml</format><repair>none</re
> pair><encoding>{$encoding}</encoding></options>)
> )
>
> ----
>
> Create c:\temp\test.xml containing something like:
>
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>
> <TEST>Fûnny cháràctërs</TEST>
>
> Do make sure to save it with the proper encoding. (Storing as
> Ansi will work as well) The code should work out of the box in CQ..
>
>
> Not the nicest code, I am aware of that (suggestions to
> enhance this code are welcome), but at least it works.. :-/
>
>
> Kind regards,
> Geert
>
> >
>
>
> Drs. G.P.H. Josten
> Consultant
>
>
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen
> geen rechten worden ontleend.
>
>
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Danny
> > Sokolsky
> > Sent: woensdag 25 maart 2009 23:49
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable
> > encoding
> >
> > It is a bit hacky, but you can try to do an
> xdmp:document-get as text,
> > then peek into the text to grab the encoding, then do an
> > xdmp:document-get on it. Something like (this is probably not that
> > robust, but it gives the idea):
> >
> > let $path := "c:/tmp/test.xml"
> > let $encoding :=
> > let $text :=
> > xdmp:document-get($path,
> > <options xmlns="xdmp:document-get">
> > <format>text</format>
> > <repair>none</repair>
> > </options>)
> > let $enc := fn:substring-after(
> > fn:substring-before($text, '"?>'),
> > 'encoding="')
> > return
> > $enc
> > return
> > xdmp:document-load($path,
> > <options xmlns="xdmp:document-load">
> > <uri>/mydoc.xml</uri>
> > <format>xml</format>
> > <repair>none</repair>
> > <encoding>{$encoding}</encoding>
> > </options>)
> >
> >
> > This will not work if there are non-utf-8 characters though, as the
> > xdmp:document-get would throw an exception. And yes, it
> would be nice
> > if the server had this capability built in.
> > But for now there are lots of ways around it.
> >
> > -Danny
> >
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Geert
> > Josten
> > Sent: Wednesday, March 25, 2009 2:37 PM
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable
> > encoding
> >
> > Hi Danny,
> >
> > Are there ways to pre-read the document as a string or binary (from
> > Xquery), get the encoding from the declaration by using straigh
> > forward functions, and use that as the value for the
> encoding option
> > to a call to xdmp:document-get to read the document with
> the correct
> > encoding?
> >
> > I could pre-parse the files outside MarkLogic Server, or rely on
> > things like MLJAM, but I would prefer not needing to.
> >
> > Has it been considered to do support the xml declaration for this
> > purpose, for instance when the xdmp:document-get was called
> without an
> > explicit encoding option? If not, would you be willing to consider
> > such addition? I really think it would improve the value.
> >
> > Kind regards,
> > Geert
> >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On
> Behalf Of Danny
> > > Sokolsky
> > > Sent: woensdag 25 maart 2009 16:43
> > > To: General Mark Logic Developer Discussion
> > > Subject: RE: [MarkLogic Dev General] Importing xml with
> > unpredictable
> > > encoding
> > >
> > > Hi Geert,
> > >
> > > You can specify the encoding with the <encoding> option to
> > > xdmp:document-get or xdmp:document-load. You do have to know the
> > > encoding though--it will not use an encoding in a header of the
> > > document on its own, and will default to UTF-8.
> > >
> > > -Danny
> > >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On
> Behalf Of Geert
> > > Josten
> > > Sent: Wednesday, March 25, 2009 6:07 AM
> > > To: General Mark Logic Developer Discussion
> > > Subject: [MarkLogic Dev General] Importing xml with unpredictable
> > > encoding
> > >
> > > Hi,
> > >
> > > Is it correct that the MarkLogic built-in functions
> > xdmp:document-load
> > > and xdmp:document-get do not respect the encoding
> > specification in the
> > > XML declaration? They expect
> > > UTF-8 by default and otherwise try to consume the file with the
> > > encoding specified in the options. Is there a way to
> > anticipate on the
> > > encoding in the XML declaration?
> > >
> > > I tried using something like xdmp:filesystem-file and (rather
> > > ugly) try parsing the string with string functions, but it
> > chokes with
> > > the message that the string contains a bad codepoint
> > (SVC-BAD: ... --
> > > Bad CodepointIterator::_next).
> > >
> > > Any ideas?
> > >
> > > Kind regards,
> > > Geert
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general