Hi Danny,
Unfortunately these documents do contain non-utf-8 characters. Just using a
common accented e of some kind already breaks the xdmp:document-get. So, that
leaves reading it as binary..
I first thought that there were no functions that were able to do anything with
binary content, but then I stumbled across the fn:data function. I started
fumbling around with tokenize and codepoints-to-string. All this resulted in
the following. I bet there are other users interested in this code, since it
allows capturing other information from the prolog as well.
----
declare function local:filesystem-file-head($path as xs:string, $length as
xs:integer) as xs:string {
let $half-bytes := tokenize(substring(string(data(xdmp:document-get($path,
<options xmlns="xdmp:document-get">
<format>binary</format>
<repair>none</repair>
</options>))), 1, $length * 2), "")
let $bytes :=
for $half-byte at $pos in $half-bytes
where ($pos mod 2) = 0
return
xdmp:hex-to-integer(concat($half-byte, $half-bytes[$pos + 1]))
return
codepoints-to-string($bytes)
};
declare function local:filesystem-file-xmldecl($path as xs:string) as xs:string
{
let $prolog :=
local:filesystem-file-head($path, 100)
return
if (contains($prolog, "<?xml") and contains($prolog, "?>")) then
concat(substring-before(concat("<?xml", substring-after($prolog,
"<?xml")), "?>"), "?>")
else
''
};
declare function local:filesystem-file-encoding($path as xs:string) as
xs:string {
let $xml-decl :=
local:filesystem-file-xmldecl($path)
return
if (matches($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$")) then
replace($xml-decl, "^<\?.*\sencoding='([^']*)'.*>$", "$1")
else if (matches($xml-decl, '<\?.*\sencoding="([^"]*)".*>$')) then
replace($xml-decl, '<\?.*\sencoding="([^"]*)".*>$', "$1")
else
"UTF-8"
};
let $path := "c:\temp\test.xml"
let $xml-decl:= local:filesystem-file-xmldecl($path)
let $encoding := local:filesystem-file-encoding($path)
let $response-decl := replace($xml-decl, $encoding, "utf-8")
return (
xdmp:set-response-content-type("application/xml; charset=utf-8"),
$response-decl,
xdmp:document-get($path, <options
xmlns="xdmp:document-get"><format>xml</format><repair>none</repair><encoding>{$encoding}</encoding></options>)
)
----
Create c:\temp\test.xml containing something like:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>
<TEST>Fûnny cháràctërs</TEST>
Do make sure to save it with the proper encoding. (Storing as Ansi will work as
well) The code should work out of the box in CQ..
Not the nicest code, I am aware of that (suggestions to enhance this code are
welcome), but at least it works.. :-/
Kind regards,
Geert
>
Drs. G.P.H. Josten
Consultant
http://www.daidalos.nl/
Daidalos BV
Source of Innovation
Hoekeindsehof 1-4
2665 JZ Bleiswijk
Tel.: +31 (0) 10 850 1200
Fax: +31 (0) 10 850 1199
http://www.daidalos.nl/
KvK 27164984
De informatie - verzonden in of met dit emailbericht - is afkomstig van
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit
bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit
bericht kunnen geen rechten worden ontleend.
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Danny Sokolsky
> Sent: woensdag 25 maart 2009 23:49
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> It is a bit hacky, but you can try to do an xdmp:document-get
> as text, then peek into the text to grab the encoding, then
> do an xdmp:document-get on it. Something like (this is
> probably not that robust, but it gives the idea):
>
> let $path := "c:/tmp/test.xml"
> let $encoding :=
> let $text :=
> xdmp:document-get($path,
> <options xmlns="xdmp:document-get">
> <format>text</format>
> <repair>none</repair>
> </options>)
> let $enc := fn:substring-after(
> fn:substring-before($text, '"?>'),
> 'encoding="')
> return
> $enc
> return
> xdmp:document-load($path,
> <options xmlns="xdmp:document-load">
> <uri>/mydoc.xml</uri>
> <format>xml</format>
> <repair>none</repair>
> <encoding>{$encoding}</encoding>
> </options>)
>
>
> This will not work if there are non-utf-8 characters though,
> as the xdmp:document-get would throw an exception. And yes,
> it would be nice if the server had this capability built in.
> But for now there are lots of ways around it.
>
> -Danny
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Geert Josten
> Sent: Wednesday, March 25, 2009 2:37 PM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> Hi Danny,
>
> Are there ways to pre-read the document as a string or binary
> (from Xquery), get the encoding from the declaration by using
> straigh forward functions, and use that as the value for the
> encoding option to a call to xdmp:document-get to read the
> document with the correct encoding?
>
> I could pre-parse the files outside MarkLogic Server, or rely
> on things like MLJAM, but I would prefer not needing to.
>
> Has it been considered to do support the xml declaration for
> this purpose, for instance when the xdmp:document-get was
> called without an explicit encoding option? If not, would you
> be willing to consider such addition? I really think it would
> improve the value.
>
> Kind regards,
> Geert
>
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Danny
> > Sokolsky
> > Sent: woensdag 25 maart 2009 16:43
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable
> > encoding
> >
> > Hi Geert,
> >
> > You can specify the encoding with the <encoding> option to
> > xdmp:document-get or xdmp:document-load. You do have to know the
> > encoding though--it will not use an encoding in a header of the
> > document on its own, and will default to UTF-8.
> >
> > -Danny
> >
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Geert
> > Josten
> > Sent: Wednesday, March 25, 2009 6:07 AM
> > To: General Mark Logic Developer Discussion
> > Subject: [MarkLogic Dev General] Importing xml with unpredictable
> > encoding
> >
> > Hi,
> >
> > Is it correct that the MarkLogic built-in functions
> xdmp:document-load
> > and xdmp:document-get do not respect the encoding
> specification in the
> > XML declaration? They expect
> > UTF-8 by default and otherwise try to consume the file with the
> > encoding specified in the options. Is there a way to
> anticipate on the
> > encoding in the XML declaration?
> >
> > I tried using something like xdmp:filesystem-file and (rather
> > ugly) try parsing the string with string functions, but it
> chokes with
> > the message that the string contains a bad codepoint
> (SVC-BAD: ... --
> > Bad CodepointIterator::_next).
> >
> > Any ideas?
> >
> > Kind regards,
> > Geert
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general