Kaspar Fischer
Mon, 15 Feb 2010 07:34:49 -0800
Dear list, I am trying to convert documents to plaintext using Tika. For this, I do:
StringWriter writer = new StringWriter();
ContentHandler handler = new BodyContentHandler(writer);
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
parser.parse(dataStream, handler, metadata, new ParseContext());
String plaintext = writer.toString();
This works like a charm and I also get some metadata (like
metadata.get("Content-Type")) for free.
The only issue I have run into is the encoding of documents. If I pass in a
file in for example cp437 encoding, the plaintext seems to be in this encoding,
too -- it is not in UTF-8 encoding. I am therefore wondering where and how the
encoding detection must be done?
The "how" in this question is probably answered like this (please correct me if
I am wrong!):
CharsetDetector detector = new CharsetDetector();
detector.setText(new BufferedInputStream(stream));
String encoding = detector.detect().getName();
but I don't know how to tell the parser/handler about the encoding.
I have tried adding the encoding to the input metadata via
metadata.add(Metadata.CONTENT_ENCODING, encoding);
but without success.
Many thanks for any feedback,
Kaspar