java.io.UnsupportedEncodingException with Russian, Chinese, ... document
------------------------------------------------------------------------
Key: TIKA-517
URL: https://issues.apache.org/jira/browse/TIKA-517
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.7
Environment: Macosx, Java 6, Eclipse
Reporter: Dominique Béjean
When I try to extract text from PDF or DOC document in Russian, Chinese,
Korean, Serbian, ..., I have an error concerning unsuported encoding.
org.xml.sax.SAXException: java.io.UnsupportedEncodingException:
at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown
Source)
at
org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
at
org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
at
org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
at
org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
...
It works fin with English or iso-8859-1 languages.
PDFBox extract correctly the text, so, I assume the problem is not in libraries
used for various format text extraction, but after.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.