And here is a Windows app to clean up BOMs:

http://www.bryntyounce.com/filebomdetector.htm

wunder
Walter Underwood
Server Engineering, MarkLogic
http://www.marklogic.com/

On Oct 17, 2011, at 10:35 AM, Walter Underwood wrote:

The FEFF sequence is a "byte order mark" or BOM. It is unnecessary in UTF-8, 
but Windows apps will sometimes get clever and add it purely to distinguish 
between UTF-8 text and ASCII text.

Using a BOM with UTF-8 causes problems right and left, but it is legal, so 
various decoding libraries started supporting it over time. I haven't 
researched the details, but I know that we upgraded the Unicode libraries for 
4.2 and I'm sure that version supports UTF-8+BOM.

More details on UTF-8+BOM: http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

On Unix, you can use the iconv program to convert between encodings. Converting 
from UTF-8 to UTF-8 doesn't remove the BOM, so you need to launder the text 
through another encoding, like this:

iconv -f UTF-8 -t UTF-16 < test.txt | iconv -f UTF-16 -t UTF-8 > test-no-bom.txt
hexdump -C test-no-bom.txt

The hexdump command shows that the BOM is gone.

wunder
Walter Underwood
Server Engineering, MarkLogic
http://www.marklogic.com/

On Oct 17, 2011, at 8:53 AM, Neil wrote:

Hi,

I am getting a weird encoding issue when importing documents.

While my local copy of MarkLogic has NO problem loading a document, when I try 
to load the same document on a remote server the following error is raised:

[1.0-ml] XDMP-DOCROOTTEXT: xdmp:document-load(“C:\TestFile.html", (), <options 
xmlns="xdmp:eval"><database>16453038828028925603</database><modules>0</modules><de...</options>)
 -- Invalid root text "&#xfeff;" at C:\TestFile.html line 1

The document loads OK in IE, which reports that the document is “Unicode- 
UTF-8” encoding, and opens OK in Oxygen too. When I open the document in 
Notepad I do not see any unusual characters on line one, but did not really 
expect to. I recall that characters FE and FF are used in Unicode to indicate 
whether the bytes are lower- or higher-byte first (but is that just in Unicode 
16?).

I tried add <encoding> options to xdmp:document-load() but none of the values I 
tried helped any.

My local configuration is:
Architecture: i686
Platform: winnt
Host: neil-pc
MarkLogic Product Edition: Standard
MarkLogic Product Version: 4.2-5

The server configuration is:
Architecture: amd64
Platform: winnt
Host: dgdbsrv1.dg.local
MarkLogic Product Edition: Enterprise
MarkLogic Product Version: 4.0-3

So the obvious question is whether differences between 4.0 and 4.2 account for 
me not seeing the error locally, and the error appearing on the server?

If so is the only solution to upgrade the server version? Or is there an easy 
way I can convert the documents into a format that WILL load into the earlier 
version of ML?

Regards,

Neil.




_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general




_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to