tika-user  

UTF-8 text files without BOM Error

Baldwin, David
Thu, 07 Jan 2010 15:29:33 -0800

Using Tika 0.4

I have tracked down an issue I really need some help on to further it.  I am 
indexing files (.txt in the case I am talking here about) that the byte-by-byte 
contents have been stored as blobs.  When being indexed, the entire contents 
are sent in as a ByteArrayInputStream.

Using:

Class:  org.apache.tika.parser.AutoDetectParser
Method: void parse(InputStream arg0, ContentHandler arg1, Metadata arg2) throws 
IOException, SAXException, TikaException

The metadata returned indicates it was detected as

"Content-Type=application/octet-stream ".

I decided to run it directly through the TXTParser and it detected it as

"Content-Encoding=ISO-8859-1 Content-Language=fr Content-Type=text/plain 
language=fr ".

Crazy since it is not a French file, it is UTF-8 encoded, however.  It also 
does not return a lot of the data and therefore clearly finds it not 
recognizable.

Simply pre-pending the BOM (0xEF 0xBB 0xBF) all is find and it is then detected 
by the AutoDetectParser as

"Content-Encoding=UTF-8 Content-Type=text/plain "

The BOM is not required for UTF-8.  Some editors put it in, and others do not.  
UTF-8 does not have different order of bytes on platforms (i.e. endian big and 
little not applicable to utf-8).  (See 
http://www.stanford.edu/~laurik/fsmbook/errata/BOM.html  as one of many 
references).  So what do I do at this point with the detection in tika?  I am 
believing that this may be a defect, I wish to discuss it first.

Thanks.

David