[ 
https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912617#action_12912617
 ] 

Ken Krugler commented on TIKA-517:
----------------------------------

Hi Dominique,

I'm not sure there's anything Tika can do here. The issue is in the Xerces 
BaseMarkupSerializer.startDocument() method, where it appears to be making a 
call to Java's Charset class (either directly, or indirectly) and the provided 
charset name isn't supported.

This can happen with the platform doesn't have the support, or you've got an 
invalid charset name from somewhere.

We'd actually coded up our own "safeCharset" method in Tika, that's used when 
processing HTML documents.

Is there any way you can extract the actual charset name that's triggering this 
exception?

Thanks,

-- Ken

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, 
> Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
>       at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown 
> Source)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
>       ...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in 
> libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to