[jira] Commented: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Jukka Zitting (JIRA) Sun, 31 Oct 2010 16:53:48 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926793#action_12926793
 ]


Jukka Zitting commented on TIKA-517:
------------------------------------

The stack trace suggests that this exception is coming from when you're 
serializing the output from Tika, so as Ken said this doesn't seem to be a Tika 
issue. How do you specify the output encoding?

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, 
> Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
>       at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown 
> Source)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
>       ...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in 
> libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Reply via email to