RE: [MarkLogic Dev General] Importing xml with unpredictable encoding

Geert Josten Thu, 26 Mar 2009 03:58:39 -0700

Hi Danny,

Unfortunately these documents do contain non-utf-8 characters. Just using a 
common accented e of some kind already breaks the xdmp:document-get. So, that 
leaves reading it as binary..



I first thought that there were no functions that were able to do anything with 
binary content, but then I stumbled across the fn:data function. I started 
fumbling around with tokenize and codepoints-to-string. All this resulted in 
the following. I bet there are other users interested in this code, since it 
allows capturing other information from the prolog as well.

----

declare function local:filesystem-file-head($path as xs:string, $length as 
xs:integer) as xs:string {
let $half-bytes := tokenize(substring(string(data(xdmp:document-get($path,
      <options xmlns="xdmp:document-get">
        <format>binary</format>
        <repair>none</repair>
      </options>))), 1, $length * 2), "")
let $bytes :=
    for $half-byte at $pos in $half-bytes
    where ($pos mod 2) = 0
    return
        xdmp:hex-to-integer(concat($half-byte, $half-bytes[$pos + 1]))

return
    codepoints-to-string($bytes)

};

declare function local:filesystem-file-xmldecl($path as xs:string) as xs:string 
{
    let $prolog :=
        local:filesystem-file-head($path, 100)
    return
        if (contains($prolog, "<?xml") and contains($prolog, "?>")) then
            concat(substring-before(concat("<?xml", substring-after($prolog, 
"<?xml")), "?>"), "?>")
        else
            ''
};

declare function local:filesystem-file-encoding($path as xs:string) as 
xs:string {
    let $xml-decl :=
        local:filesystem-file-xmldecl($path)
    return
    if (matches($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$")) then
        replace($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$", "$1")
    else if (matches($xml-decl, '<\?.*\sencoding="([^"]*)".*>$')) then
        replace($xml-decl, '<\?.*\sencoding="([^"]*)".*>$', "$1")
    else
        "UTF-8"
};

let $path := "c:\temp\test.xml"
let $xml-decl:= local:filesystem-file-xmldecl($path)
let $encoding := local:filesystem-file-encoding($path)
let $response-decl := replace($xml-decl, $encoding, "utf-8")
return (
    xdmp:set-response-content-type("application/xml; charset=utf-8"),
    $response-decl,
    xdmp:document-get($path, <options 
xmlns="xdmp:document-get"><format>xml</format><repair>none</repair><encoding>{$encoding}</encoding></options>)
)

----

Create c:\temp\test.xml containing something like:

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>
<TEST>Fûnny cháràctërs</TEST>

Do make sure to save it with the proper encoding. (Storing as Ansi will work as 
well) The code should work out of the box in CQ..


Not the nicest code, I am aware of that (suggestions to enhance this code are 
welcome), but at least it works.. :-/


Kind regards,
Geert

>


Drs. G.P.H. Josten
Consultant


http://www.daidalos.nl/
Daidalos BV
Source of Innovation
Hoekeindsehof 1-4
2665 JZ Bleiswijk
Tel.: +31 (0) 10 850 1200
Fax: +31 (0) 10 850 1199
http://www.daidalos.nl/
KvK 27164984
De informatie - verzonden in of met dit emailbericht - is afkomstig van 
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit 
bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit 
bericht kunnen geen rechten worden ontleend.


> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Danny Sokolsky
> Sent: woensdag 25 maart 2009 23:49
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> It is a bit hacky, but you can try to do an xdmp:document-get
> as text, then peek into the text to grab the encoding, then
> do an xdmp:document-get on it.  Something like (this is
> probably not that robust, but it gives the idea):
>
> let $path := "c:/tmp/test.xml"
> let $encoding :=
>   let $text :=
>     xdmp:document-get($path,
>       <options xmlns="xdmp:document-get">
>         <format>text</format>
>         <repair>none</repair>
>       </options>)
>   let $enc := fn:substring-after(
>                    fn:substring-before($text, '"?>'),
>                         'encoding="')
>   return
>   $enc
> return
> xdmp:document-load($path,
>  <options xmlns="xdmp:document-load">
>    <uri>/mydoc.xml</uri>
>    <format>xml</format>
>    <repair>none</repair>
>    <encoding>{$encoding}</encoding>
>  </options>)
>
>
> This will not work if there are non-utf-8 characters though,
> as the xdmp:document-get would throw an exception.  And yes,
> it would be nice if the server had this capability built in.
> But for now there are lots of ways around it.
>
> -Danny
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Geert Josten
> Sent: Wednesday, March 25, 2009 2:37 PM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> Hi Danny,
>
> Are there ways to pre-read the document as a string or binary
> (from Xquery), get the encoding from the declaration by using
> straigh forward functions, and use that as the value for the
> encoding option to a call to xdmp:document-get to read the
> document with the correct encoding?
>
> I could pre-parse the files outside MarkLogic Server, or rely
> on things like MLJAM, but I would prefer not needing to.
>
> Has it been considered to do support the xml declaration for
> this purpose, for instance when the xdmp:document-get was
> called without an explicit encoding option? If not, would you
> be willing to consider such addition? I really think it would
> improve the value.
>
> Kind regards,
> Geert
>
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Danny
> > Sokolsky
> > Sent: woensdag 25 maart 2009 16:43
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable
> > encoding
> >
> > Hi Geert,
> >
> > You can specify the encoding with the <encoding> option to
> > xdmp:document-get or xdmp:document-load.  You do have to know the
> > encoding though--it will not use an encoding in a header of the
> > document on its own, and will default to UTF-8.
> >
> > -Danny
> >
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Geert
> > Josten
> > Sent: Wednesday, March 25, 2009 6:07 AM
> > To: General Mark Logic Developer Discussion
> > Subject: [MarkLogic Dev General] Importing xml with unpredictable
> > encoding
> >
> > Hi,
> >
> > Is it correct that the MarkLogic built-in functions
> xdmp:document-load
> > and xdmp:document-get do not respect the encoding
> specification in the
> > XML declaration? They expect
> > UTF-8 by default and otherwise try to consume the file with the
> > encoding specified in the options. Is there a way to
> anticipate on the
> > encoding in the XML declaration?
> >
> > I tried using something like xdmp:filesystem-file and (rather
> > ugly) try parsing the string with string functions, but it
> chokes with
> > the message that the string contains a bad codepoint
> (SVC-BAD: ... --
> > Bad CodepointIterator::_next).
> >
> > Any ideas?
> >
> > Kind regards,
> > Geert


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Importing xml with unpredictable encoding

Reply via email to