It is a bit hacky, but you can try to do an xdmp:document-get as text,
then peek into the text to grab the encoding, then do an
xdmp:document-get on it. Something like (this is probably not that
robust, but it gives the idea):
let $path := "c:/tmp/test.xml"
let $encoding :=
let $text :=
xdmp:document-get($path,
<options xmlns="xdmp:document-get">
<format>text</format>
<repair>none</repair>
</options>)
let $enc := fn:substring-after(
fn:substring-before($text, '"?>'),
'encoding="')
return
$enc
return
xdmp:document-load($path,
<options xmlns="xdmp:document-load">
<uri>/mydoc.xml</uri>
<format>xml</format>
<repair>none</repair>
<encoding>{$encoding}</encoding>
</options>)
This will not work if there are non-utf-8 characters though, as the
xdmp:document-get would throw an exception. And yes, it would be nice
if the server had this capability built in. But for now there are lots
of ways around it.
-Danny
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Geert
Josten
Sent: Wednesday, March 25, 2009 2:37 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] Importing xml with unpredictable
encoding
Hi Danny,
Are there ways to pre-read the document as a string or binary (from
Xquery), get the encoding from the declaration by using straigh forward
functions, and use that as the value for the encoding option to a call
to xdmp:document-get to read the document with the correct encoding?
I could pre-parse the files outside MarkLogic Server, or rely on things
like MLJAM, but I would prefer not needing to.
Has it been considered to do support the xml declaration for this
purpose, for instance when the xdmp:document-get was called without an
explicit encoding option? If not, would you be willing to consider such
addition? I really think it would improve the value.
Kind regards,
Geert
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Danny Sokolsky
> Sent: woensdag 25 maart 2009 16:43
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> Hi Geert,
>
> You can specify the encoding with the <encoding> option to
> xdmp:document-get or xdmp:document-load. You do have to know
> the encoding though--it will not use an encoding in a header
> of the document on its own, and will default to UTF-8.
>
> -Danny
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> Geert Josten
> Sent: Wednesday, March 25, 2009 6:07 AM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] Importing xml with
> unpredictable encoding
>
> Hi,
>
> Is it correct that the MarkLogic built-in functions
> xdmp:document-load and xdmp:document-get do not respect the
> encoding specification in the XML declaration? They expect
> UTF-8 by default and otherwise try to consume the file with
> the encoding specified in the options. Is there a way to
> anticipate on the encoding in the XML declaration?
>
> I tried using something like xdmp:filesystem-file and (rather
> ugly) try parsing the string with string functions, but it
> chokes with the message that the string contains a bad
> codepoint (SVC-BAD: ... -- Bad CodepointIterator::_next).
>
> Any ideas?
>
> Kind regards,
> Geert
>
>
> Drs. G.P.H. Josten
> Consultant
>
>
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen
> geen rechten worden ontleend.
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general