[
https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912617#action_12912617
]
Ken Krugler commented on TIKA-517:
----------------------------------
Hi Dominique,
I'm not sure there's anything Tika can do here. The issue is in the Xerces
BaseMarkupSerializer.startDocument() method, where it appears to be making a
call to Java's Charset class (either directly, or indirectly) and the provided
charset name isn't supported.
This can happen with the platform doesn't have the support, or you've got an
invalid charset name from somewhere.
We'd actually coded up our own "safeCharset" method in Tika, that's used when
processing HTML documents.
Is there any way you can extract the actual charset name that's triggering this
exception?
Thanks,
-- Ken
> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
> Key: TIKA-517
> URL: https://issues.apache.org/jira/browse/TIKA-517
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: Macosx, Java 6, Eclipse
> Reporter: Dominique Béjean
> Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese,
> Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException:
> at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown
> Source)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> at
> org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> ...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in
> libraries used for various format text extraction, but after.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.