[ 
https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927367#action_12927367
 ] 

Dominique Béjean commented on TIKA-517:
---------------------------------------

Hi,

Thank you for these replies.

In order to provide the a sample of my code, I made some tests and I can't 
reproduce the issue anymore.

My code looks like :

                                input = new FileInputStream("russian.pdf");
                                contentType="application/pdf";
                                outputEncoding="UTF-8";

                                ParseContext context = new ParseContext();
                                Parser parser = new AutoDetectParser();
                                context.set(Parser.class, parser);

                                Metadata metadata = new Metadata();
                                metadata.add("stream_content_type", 
contentType);
                                StringWriter writer = new StringWriter();
                                BaseMarkupSerializer serializer = null;
                                serializer = new TextSerializer();
                                serializer.setOutputCharStream(writer);
                                serializer.setOutputFormat(new 
OutputFormat("text", outputEncoding, true));
                                parser.parse(input, serializer, metadata, 
context);
                                writer.close();

                                content = writer.toString();

If I reproduce the problem later, I will provide details.

Dominique

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, 
> Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
>       at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown 
> Source)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
>       ...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in 
> libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to