[
https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926793#action_12926793
]
Jukka Zitting commented on TIKA-517:
------------------------------------
The stack trace suggests that this exception is coming from when you're
serializing the output from Tika, so as Ken said this doesn't seem to be a Tika
issue. How do you specify the output encoding?
> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
> Key: TIKA-517
> URL: https://issues.apache.org/jira/browse/TIKA-517
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Macosx, Java 6, Eclipse
> Reporter: Dominique Béjean
> Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese,
> Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException:
> at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown
> Source)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at
> org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> ...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in
> libraries used for various format text extraction, but after.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.