Baldwin, David
Thu, 07 Jan 2010 15:29:33 -0800
Using Tika 0.4 I have tracked down an issue I really need some help on to further it. I am indexing files (.txt in the case I am talking here about) that the byte-by-byte contents have been stored as blobs. When being indexed, the entire contents are sent in as a ByteArrayInputStream.
Using: Class: org.apache.tika.parser.AutoDetectParser Method: void parse(InputStream arg0, ContentHandler arg1, Metadata arg2) throws IOException, SAXException, TikaException The metadata returned indicates it was detected as "Content-Type=application/octet-stream ". I decided to run it directly through the TXTParser and it detected it as "Content-Encoding=ISO-8859-1 Content-Language=fr Content-Type=text/plain language=fr ". Crazy since it is not a French file, it is UTF-8 encoded, however. It also does not return a lot of the data and therefore clearly finds it not recognizable. Simply pre-pending the BOM (0xEF 0xBB 0xBF) all is find and it is then detected by the AutoDetectParser as "Content-Encoding=UTF-8 Content-Type=text/plain " The BOM is not required for UTF-8. Some editors put it in, and others do not. UTF-8 does not have different order of bytes on platforms (i.e. endian big and little not applicable to utf-8). (See http://www.stanford.edu/~laurik/fsmbook/errata/BOM.html as one of many references). So what do I do at this point with the detection in tika? I am believing that this may be a defect, I wish to discuss it first. Thanks. David