John, I don't think "auto" does what you want. As far as I know it doesn't pay attention to the XML declaration. Rather, it uses the same mechanism as xdmp:encoding-language-detect.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/dev_guide/loading.xml%2393952 > When you load a document with xdmp:document-load, you can specify the > <encoding>auto</encoding> option to have MarkLogic Server automatically > detect the encoding. Encoding detection is not an exact science, as there are > cases that can be ambiguous, but as long as your document is not too small, > the encoding detection is fairly accurate. There are, however, cases where it > might not get the encoding correct. The automatic encoding detection chooses > an encoding equivalent to the first encoding returned by > xdmp:encoding-language-detect function. As the docs say, no encoding detection can be 100% accurate for all possible documents. You may have noticed that web browsers often get it wrong, for example. This is an unavoidable consequence of the past evolution of encoding usage (cf http://www.joelonsoftware.com/articles/Unicode.html for a good overview of the problem). Generally speaking I use a triage approach for content with multiple document encodings. I'll try UTF-8 first, and if that blows up I'll try whatever I think is common in the corpus: iso-8859-1 or win-1252 are usually good candidates. If neither works, then I'll shove the document in as binary, with a special collection so that I can review it. This isn't a 100% solution, since some iso-8859-1 documents will decode as UTF-8 without error (but not without loss of fidelity). Similarly, most of the pre-Unicode encodings use much the same set of valid codepoints, so it's possible to decode one with the wrong encoder without errors, but with loss of fidelity. If you are confident that your documents all have XML declarations, and that these declarations are correct, then you could integrate that into your triage process. Get each document as text, tokenize by newline, and look at the encoding in the first line. Then supply that value when you get the document as XML. It would be a nice feature if xdmp:document-get et al. could do this for you. But as far as I know they do not. Here's a test, which uses some tricks to simulate the input that xdmp:document-get might see. I'm not sure if it's a fully valid test, but it doesn't seem to use the XML declaration. xdmp:encoding-language-detect( binary { xs:hexBinary( xs:base64Binary( xdmp:base64-encode( "<?xml version="1.0" encoding="iso-8859-1"?> <test>this is a test</test>")))}) => <encoding-language xmlns="xdmp:encoding-language-detect"> <encoding>windows-1252</encoding> <language>en</language> <score>5.435</score> </encoding-language> <encoding-language xmlns="xdmp:encoding-language-detect"> <encoding>windows-1252</encoding> <language>de</language> <score>4.272</score> </encoding-language> <encoding-language xmlns="xdmp:encoding-language-detect"> <encoding>windows-1252</encoding> <language>fr</language> <score>4.089</score> </encoding-language> [...etc] You can see that the scores aren't very high, so the detection code isn't very confident. It did detect English, but it thinks the encoding is likely to be windows-1252 (or windows-1250 later in the results). I ran this test on OS X 10.6, using 5.0-1. None of the output mentions iso-8859-1, which I specified in the declaration. -- Mike On 3 Nov 2011, at 09:24 , John Zhong wrote: > Hi all, > > I want to know how MarkLogic (I am using 4.2.7 version) determine the > encoding if setting "auto" option in xdmp:document-get function? for example: > > xdmp:document-get('D:\TOC-oe-17-26.xml', > <options xmlns="xdmp:document-get"> > <encoding>auto</encoding> > </options>) > > I have many xml files in file system, but they are declaring different > encoding, like <?xml version="1.0" encoding="UTF-8"?>, <?xml version="1.0" > encoding="iso-8859-1"?>. So I want to use the "auto" encoding option to read > them without specifying encoding. When I tested to read some utf-8 xml files > by setting the "auto" option, some characters are messy. For example: > > It returns the "Brückner" (should be "Brückner"), but when I set the > encoding option to "utf-8", it returns correctly. > <article author="Brückner" fpage="24334" lpage="24341" msid="120315" > type="Regular"/> > > Thanks, > John > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
