RE: [MarkLogic Dev General] Importing xml with unpredictable encoding

Geert Josten Thu, 26 Mar 2009 11:48:53 -0700

Hi Danny,

Interesting. I hadn't noticed that function yet. :-)


It does sound more efficient to first limit the data and then call data() and 
string(), but it won't accept document { binary { ...} } (the returned value 
from the xdmp:document-get), so I have to add a /node() to get rid of the 
wrapping document node. I have the impression that sticking to substring is 
performing just a little bit better for this reason, but it is difficult to 
measure that accurately..

Kind regards,
Geert

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Danny Sokolsky
> Sent: donderdag 26 maart 2009 18:35
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with 
> unpredictable encoding
> 
> Another function you can try for this approach is xdmp:subbinary:
> 
> http://developer.marklogic.com/pubs/4.0/apidocs/Extension.html
> #xdmp:subbinary
> 
> -Danny
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Geert Josten
> Sent: Thursday, March 26, 2009 3:58 AM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with 
> unpredictable encoding
> 
> Hi Danny,
> 
> Unfortunately these documents do contain non-utf-8 
> characters. Just using a common accented e of some kind 
> already breaks the xdmp:document-get. So, that leaves reading 
> it as binary..
> 
> 
> I first thought that there were no functions that were able 
> to do anything with binary content, but then I stumbled 
> across the fn:data function. I started fumbling around with 
> tokenize and codepoints-to-string. All this resulted in the 
> following. I bet there are other users interested in this 
> code, since it allows capturing other information from the 
> prolog as well.
> 
> ----
> 
> declare function local:filesystem-file-head($path as 
> xs:string, $length as xs:integer) as xs:string { let 
> $half-bytes := tokenize(substring(string(data(xdmp:document-get($path,
>       <options xmlns="xdmp:document-get">
>         <format>binary</format>
>         <repair>none</repair>
>       </options>))), 1, $length * 2), "") let $bytes :=
>     for $half-byte at $pos in $half-bytes
>     where ($pos mod 2) = 0
>     return
>         xdmp:hex-to-integer(concat($half-byte, $half-bytes[$pos + 1]))
> 
> return
>     codepoints-to-string($bytes)
> 
> };
> 
> declare function local:filesystem-file-xmldecl($path as 
> xs:string) as xs:string {
>     let $prolog :=
>         local:filesystem-file-head($path, 100)
>     return
>         if (contains($prolog, "<?xml") and contains($prolog, 
> "?>")) then
>             concat(substring-before(concat("<?xml", 
> substring-after($prolog, "<?xml")), "?>"), "?>")
>         else
>             ''
> };
> 
> declare function local:filesystem-file-encoding($path as 
> xs:string) as xs:string {
>     let $xml-decl :=
>         local:filesystem-file-xmldecl($path)
>     return
>     if (matches($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$")) then
>         replace($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$", "$1")
>     else if (matches($xml-decl, '<\?.*\sencoding="([^"]*)".*>$')) then
>         replace($xml-decl, '<\?.*\sencoding="([^"]*)".*>$', "$1")
>     else
>         "UTF-8"
> };
> 
> let $path := "c:\temp\test.xml"
> let $xml-decl:= local:filesystem-file-xmldecl($path)
> let $encoding := local:filesystem-file-encoding($path)
> let $response-decl := replace($xml-decl, $encoding, "utf-8") return (
>     xdmp:set-response-content-type("application/xml; charset=utf-8"),
>     $response-decl,
>     xdmp:document-get($path, <options 
> xmlns="xdmp:document-get"><format>xml</format><repair>none</re
> pair><encoding>{$encoding}</encoding></options>)
> )
> 
> ----
> 
> Create c:\temp\test.xml containing something like:
> 
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?> 
> <TEST>Fûnny cháràctërs</TEST>
> 
> Do make sure to save it with the proper encoding. (Storing as 
> Ansi will work as well) The code should work out of the box in CQ..
> 
> 
> Not the nicest code, I am aware of that (suggestions to 
> enhance this code are welcome), but at least it works.. :-/
> 
> 
> Kind regards,
> Geert
> 
> >
> 
> 
> Drs. G.P.H. Josten
> Consultant
> 
> 
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is 
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de 
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, 
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen 
> geen rechten worden ontleend.
> 
> 
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Danny 
> > Sokolsky
> > Sent: woensdag 25 maart 2009 23:49
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] Importing xml with 
> unpredictable 
> > encoding
> >
> > It is a bit hacky, but you can try to do an 
> xdmp:document-get as text, 
> > then peek into the text to grab the encoding, then do an 
> > xdmp:document-get on it.  Something like (this is probably not that 
> > robust, but it gives the idea):
> >
> > let $path := "c:/tmp/test.xml"
> > let $encoding :=
> >   let $text :=
> >     xdmp:document-get($path,
> >       <options xmlns="xdmp:document-get">
> >         <format>text</format>
> >         <repair>none</repair>
> >       </options>)
> >   let $enc := fn:substring-after(
> >                    fn:substring-before($text, '"?>'),
> >                         'encoding="')
> >   return
> >   $enc
> > return
> > xdmp:document-load($path,
> >  <options xmlns="xdmp:document-load">
> >    <uri>/mydoc.xml</uri>
> >    <format>xml</format>
> >    <repair>none</repair>
> >    <encoding>{$encoding}</encoding>
> >  </options>)
> >
> >
> > This will not work if there are non-utf-8 characters though, as the 
> > xdmp:document-get would throw an exception.  And yes, it 
> would be nice 
> > if the server had this capability built in.
> > But for now there are lots of ways around it.
> >
> > -Danny
> >
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Geert 
> > Josten
> > Sent: Wednesday, March 25, 2009 2:37 PM
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] Importing xml with 
> unpredictable 
> > encoding
> >
> > Hi Danny,
> >
> > Are there ways to pre-read the document as a string or binary (from 
> > Xquery), get the encoding from the declaration by using straigh 
> > forward functions, and use that as the value for the 
> encoding option 
> > to a call to xdmp:document-get to read the document with 
> the correct 
> > encoding?
> >
> > I could pre-parse the files outside MarkLogic Server, or rely on 
> > things like MLJAM, but I would prefer not needing to.
> >
> > Has it been considered to do support the xml declaration for this 
> > purpose, for instance when the xdmp:document-get was called 
> without an 
> > explicit encoding option? If not, would you be willing to consider 
> > such addition? I really think it would improve the value.
> >
> > Kind regards,
> > Geert
> >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On 
> Behalf Of Danny 
> > > Sokolsky
> > > Sent: woensdag 25 maart 2009 16:43
> > > To: General Mark Logic Developer Discussion
> > > Subject: RE: [MarkLogic Dev General] Importing xml with
> > unpredictable
> > > encoding
> > >
> > > Hi Geert,
> > >
> > > You can specify the encoding with the <encoding> option to 
> > > xdmp:document-get or xdmp:document-load.  You do have to know the 
> > > encoding though--it will not use an encoding in a header of the 
> > > document on its own, and will default to UTF-8.
> > >
> > > -Danny
> > >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On 
> Behalf Of Geert 
> > > Josten
> > > Sent: Wednesday, March 25, 2009 6:07 AM
> > > To: General Mark Logic Developer Discussion
> > > Subject: [MarkLogic Dev General] Importing xml with unpredictable 
> > > encoding
> > >
> > > Hi,
> > >
> > > Is it correct that the MarkLogic built-in functions
> > xdmp:document-load
> > > and xdmp:document-get do not respect the encoding
> > specification in the
> > > XML declaration? They expect
> > > UTF-8 by default and otherwise try to consume the file with the 
> > > encoding specified in the options. Is there a way to
> > anticipate on the
> > > encoding in the XML declaration?
> > >
> > > I tried using something like xdmp:filesystem-file and (rather
> > > ugly) try parsing the string with string functions, but it
> > chokes with
> > > the message that the string contains a bad codepoint
> > (SVC-BAD: ... --
> > > Bad CodepointIterator::_next).
> > >
> > > Any ideas?
> > >
> > > Kind regards,
> > > Geert
> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Importing xml with unpredictable encoding

Reply via email to