java.io.UnsupportedEncodingException with Russian, Chinese, ... document
------------------------------------------------------------------------

                 Key: TIKA-517
                 URL: https://issues.apache.org/jira/browse/TIKA-517
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
         Environment: Macosx, Java 6, Eclipse
            Reporter: Dominique Béjean


When I try to extract text from PDF or DOC document in Russian, Chinese, 
Korean, Serbian, ..., I have an error concerning unsuported encoding.

org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
        at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown 
Source)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
        at 
org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
        ...

It works fin with English or iso-8859-1 languages.

PDFBox extract correctly the text, so, I assume the problem is not in libraries 
used for various format text extraction, but after.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to