It is a bit hacky, but you can try to do an xdmp:document-get as text,
then peek into the text to grab the encoding, then do an
xdmp:document-get on it.  Something like (this is probably not that
robust, but it gives the idea):
 
let $path := "c:/tmp/test.xml"
let $encoding :=
  let $text := 
    xdmp:document-get($path, 
      <options xmlns="xdmp:document-get">
        <format>text</format>
        <repair>none</repair>
      </options>)
  let $enc := fn:substring-after(
                   fn:substring-before($text, '"?>'),
                        'encoding="')
  return 
  $enc
return 
xdmp:document-load($path, 
 <options xmlns="xdmp:document-load">
   <uri>/mydoc.xml</uri>
   <format>xml</format>
   <repair>none</repair>
   <encoding>{$encoding}</encoding>
 </options>)
  

This will not work if there are non-utf-8 characters though, as the
xdmp:document-get would throw an exception.  And yes, it would be nice
if the server had this capability built in.  But for now there are lots
of ways around it.

-Danny

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Geert
Josten
Sent: Wednesday, March 25, 2009 2:37 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] Importing xml with unpredictable
encoding

Hi Danny,

Are there ways to pre-read the document as a string or binary (from
Xquery), get the encoding from the declaration by using straigh forward
functions, and use that as the value for the encoding option to a call
to xdmp:document-get to read the document with the correct encoding?

I could pre-parse the files outside MarkLogic Server, or rely on things
like MLJAM, but I would prefer not needing to.

Has it been considered to do support the xml declaration for this
purpose, for instance when the xdmp:document-get was called without an
explicit encoding option? If not, would you be willing to consider such
addition? I really think it would improve the value.

Kind regards,
Geert

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Danny Sokolsky
> Sent: woensdag 25 maart 2009 16:43
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Importing xml with 
> unpredictable encoding
> 
> Hi Geert,
> 
> You can specify the encoding with the <encoding> option to 
> xdmp:document-get or xdmp:document-load.  You do have to know 
> the encoding though--it will not use an encoding in a header 
> of the document on its own, and will default to UTF-8.  
> 
> -Danny
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of 
> Geert Josten
> Sent: Wednesday, March 25, 2009 6:07 AM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] Importing xml with 
> unpredictable encoding
> 
> Hi,
> 
> Is it correct that the MarkLogic built-in functions 
> xdmp:document-load and xdmp:document-get do not respect the 
> encoding specification in the XML declaration? They expect 
> UTF-8 by default and otherwise try to consume the file with 
> the encoding specified in the options. Is there a way to 
> anticipate on the encoding in the XML declaration?
> 
> I tried using something like xdmp:filesystem-file and (rather 
> ugly) try parsing the string with string functions, but it 
> chokes with the message that the string contains a bad 
> codepoint (SVC-BAD: ... -- Bad CodepointIterator::_next).
> 
> Any ideas?
> 
> Kind regards,
> Geert
> 
> 
> Drs. G.P.H. Josten
> Consultant
> 
> 
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is 
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de 
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, 
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen 
> geen rechten worden ontleend.
> 
> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to