Re: [MarkLogic Dev General] encoding option "auto" in xdmp:document-get function

John Zhong Thu, 03 Nov 2011 17:45:11 -0700

Thank you Mike. And yes, that is what I did as you mentioned below:

Get each document as text, tokenize by newline, and look at the encoding in
the first line. Then supply that value when you get the document as XML.


John

On Fri, Nov 4, 2011 at 3:31 AM, Michael Blakeley <[email protected]> wrote:

> John, I don't think "auto" does what you want. As far as I know it doesn't
> pay attention to the XML declaration. Rather, it uses the same mechanism as
> xdmp:encoding-language-detect.
>
>
> http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/dev_guide/loading.xml%2393952
>
> > When you load a document with xdmp:document-load, you can specify the
> <encoding>auto</encoding> option to have MarkLogic Server automatically
> detect the encoding. Encoding detection is not an exact science, as there
> are cases that can be ambiguous, but as long as your document is not too
> small, the encoding detection is fairly accurate. There are, however, cases
> where it might not get the encoding correct. The automatic encoding
> detection chooses an encoding equivalent to the first encoding returned by
> xdmp:encoding-language-detect function.
>
> As the docs say, no encoding detection can be 100% accurate for all
> possible documents. You may have noticed that web browsers often get it
> wrong, for example. This is an unavoidable consequence of the past
> evolution of encoding usage (cf
> http://www.joelonsoftware.com/articles/Unicode.html for a good overview
> of the problem).
>
> Generally speaking I use a triage approach for content with multiple
> document encodings. I'll try UTF-8 first, and if that blows up I'll try
> whatever I think is common in the corpus: iso-8859-1 or win-1252 are
> usually good candidates. If neither works, then I'll shove the document in
> as binary, with a special collection so that I can review it. This isn't a
> 100% solution, since some iso-8859-1 documents will decode as UTF-8 without
> error (but not without loss of fidelity). Similarly, most of the
> pre-Unicode encodings use much the same set of valid codepoints, so it's
> possible to decode one with the wrong encoder without errors, but with loss
> of fidelity.
>
> If you are confident that your documents all have XML declarations, and
> that these declarations are correct, then you could integrate that into
> your triage process. Get each document as text, tokenize by newline, and
> look at the encoding in the first line. Then supply that value when you get
> the document as XML.
>
> It would be a nice feature if xdmp:document-get et al. could do this for
> you. But as far as I know they do not. Here's a test, which uses some
> tricks to simulate the input that xdmp:document-get might see. I'm not sure
> if it's a fully valid test, but it doesn't seem to use the XML declaration.
>
> xdmp:encoding-language-detect(
>  binary {
>    xs:hexBinary(
>      xs:base64Binary(
>        xdmp:base64-encode(
>          "<?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot;?>
> <test>this is a test</test>")))})
> =>
> <encoding-language xmlns="xdmp:encoding-language-detect">
>  <encoding>windows-1252</encoding>
>  <language>en</language>
>  <score>5.435</score>
> </encoding-language>
> <encoding-language xmlns="xdmp:encoding-language-detect">
>  <encoding>windows-1252</encoding>
>  <language>de</language>
>  <score>4.272</score>
> </encoding-language>
> <encoding-language xmlns="xdmp:encoding-language-detect">
>  <encoding>windows-1252</encoding>
>  <language>fr</language>
>  <score>4.089</score>
> </encoding-language>
> [...etc]
>
> You can see that the scores aren't very high, so the detection code isn't
> very confident. It did detect English, but it thinks the encoding is likely
> to be windows-1252 (or windows-1250 later in the results). I ran this test
> on OS X 10.6, using 5.0-1. None of the output mentions iso-8859-1, which I
> specified in the declaration.
>
> -- Mike
>
> On 3 Nov 2011, at 09:24 , John Zhong wrote:
>
> > Hi all,
> >
> > I want to know how MarkLogic (I am using 4.2.7 version) determine the
> encoding if setting "auto" option in xdmp:document-get function? for
> example:
> >
> > xdmp:document-get('D:\TOC-oe-17-26.xml',
> >        <options xmlns="xdmp:document-get">
> >            <encoding>auto</encoding>
> >        </options>)
> >
> > I have many xml files in file system, but they are declaring different
> encoding, like <?xml version="1.0" encoding="UTF-8"?>, <?xml version="1.0"
> encoding="iso-8859-1"?>. So I want to use the "auto" encoding option to
> read them without specifying encoding. When I tested to read some utf-8 xml
> files by setting the "auto" option, some characters are messy. For example:
> >
> > It returns the "BrÃ¼ckner" (should be "Brückner"), but when I set the
> encoding option to "utf-8", it returns correctly.
> > <article author="BrÃ¼ckner" fpage="24334" lpage="24341" msid="120315"
> type="Regular"/>
> >
> > Thanks,
> > John
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] encoding option "auto" in xdmp:document-get function

Reply via email to