Re: [MarkLogic Dev General] encoding option "auto" in xdmp:document-get function

Michael Blakeley Thu, 03 Nov 2011 12:32:03 -0700

John, I don't think "auto" does what you want. As far as I know it doesn't pay 
attention to the XML declaration. Rather, it uses the same mechanism as 
xdmp:encoding-language-detect.

http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/dev_guide/loading.xml%2393952

> When you load a document with xdmp:document-load, you can specify the 
> <encoding>auto</encoding> option to have MarkLogic Server automatically 
> detect the encoding. Encoding detection is not an exact science, as there are 
> cases that can be ambiguous, but as long as your document is not too small, 
> the encoding detection is fairly accurate. There are, however, cases where it 
> might not get the encoding correct. The automatic encoding detection chooses 
> an encoding equivalent to the first encoding returned by 
> xdmp:encoding-language-detect function.

As the docs say, no encoding detection can be 100% accurate for all possible 
documents. You may have noticed that web browsers often get it wrong, for 
example. This is an unavoidable consequence of the past evolution of encoding 
usage (cf http://www.joelonsoftware.com/articles/Unicode.html for a good 
overview of the problem).

Generally speaking I use a triage approach for content with multiple document 
encodings. I'll try UTF-8 first, and if that blows up I'll try whatever I think 
is common in the corpus: iso-8859-1 or win-1252 are usually good candidates. If 
neither works, then I'll shove the document in as binary, with a special 
collection so that I can review it. This isn't a 100% solution, since some 
iso-8859-1 documents will decode as UTF-8 without error (but not without loss 
of fidelity). Similarly, most of the pre-Unicode encodings use much the same 
set of valid codepoints, so it's possible to decode one with the wrong encoder 
without errors, but with loss of fidelity.

If you are confident that your documents all have XML declarations, and that 
these declarations are correct, then you could integrate that into your triage 
process. Get each document as text, tokenize by newline, and look at the 
encoding in the first line. Then supply that value when you get the document as 
XML.

It would be a nice feature if xdmp:document-get et al. could do this for you. 
But as far as I know they do not. Here's a test, which uses some tricks to 
simulate the input that xdmp:document-get might see. I'm not sure if it's a 
fully valid test, but it doesn't seem to use the XML declaration.

xdmp:encoding-language-detect(
  binary {
    xs:hexBinary(
      xs:base64Binary(
        xdmp:base64-encode(
          "<?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot;?>
<test>this is a test</test>")))})
=>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>windows-1252</encoding>
  <language>en</language>
  <score>5.435</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>windows-1252</encoding>
  <language>de</language>
  <score>4.272</score>
</encoding-language>
<encoding-language xmlns="xdmp:encoding-language-detect">
  <encoding>windows-1252</encoding>
  <language>fr</language>
  <score>4.089</score>
</encoding-language>
[...etc]

You can see that the scores aren't very high, so the detection code isn't very 
confident. It did detect English, but it thinks the encoding is likely to be 
windows-1252 (or windows-1250 later in the results). I ran this test on OS X 
10.6, using 5.0-1. None of the output mentions iso-8859-1, which I specified in 
the declaration.

-- Mike

On 3 Nov 2011, at 09:24 , John Zhong wrote:

> Hi all,
> 
> I want to know how MarkLogic (I am using 4.2.7 version) determine the 
> encoding if setting "auto" option in xdmp:document-get function? for example:
> 
> xdmp:document-get('D:\TOC-oe-17-26.xml',
>        <options xmlns="xdmp:document-get">
>            <encoding>auto</encoding>
>        </options>)
> 
> I have many xml files in file system, but they are declaring different 
> encoding, like <?xml version="1.0" encoding="UTF-8"?>, <?xml version="1.0" 
> encoding="iso-8859-1"?>. So I want to use the "auto" encoding option to read 
> them without specifying encoding. When I tested to read some utf-8 xml files 
> by setting the "auto" option, some characters are messy. For example:
> 
> It returns the "BrÃ¼ckner" (should be "Brückner"), but when I set the 
> encoding option to "utf-8", it returns correctly.
> <article author="BrÃ¼ckner" fpage="24334" lpage="24341" msid="120315" 
> type="Regular"/>
> 
> Thanks,
> John
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] encoding option "auto" in xdmp:document-get function

Reply via email to