tika-user  

UTF-8 Problem in SNAPSHOT-0.5

Wermter, Joachim
Fri, 13 Nov 2009 06:53:56 -0800

Hi,

I'm using the UIMA Tika Annotator which calls Tika in the following way:

Parser.parse(originalStream, handler, md);

originalStream is a BufferedInputStream. I've upgraded the Tika dependencies 
from 0.4 to 0.5-SNAPSHOT, and the problem I got now is that the InputStream is 
not properly UTF-8 decoded any more (e.g. German umlaut). Was there a change in 
the 0.5-SNAPSHOT which affects this?

Best regards,
Joachim


Siemens AG
Corporate Technology
CT IC 1
Otto-Hahn-Ring 6
81739 München, Deutschland
Tel.: +49 (89) 636-33647
Fax: +49 (89) 636-49438
mailto:joachim.werm...@siemens.com

Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; 
Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Heinrich Hiesinger, Joe 
Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; 
Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin 
Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322


  • UTF-8 Problem in SNAPSHOT-0.5 Wermter, Joachim