[
https://issues.apache.org/jira/browse/PDFBOX-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205917#comment-15205917
]
Tilman Hausherr commented on PDFBOX-3281:
-----------------------------------------
PDFText2HTML escapes everything, but ExtractText outputs it in the chosen
encoding, i.e. using "-encoding utf16" really creates an utf16 file, i.e. with
1 character per word. The problem is that writeHeader() does not know its own
encoding but it should. Alternatively we could just forbid the encoding
parameter, or tell that it is ignored.
> HTML output wrongly specifies UTF-16 in header
> ----------------------------------------------
>
> Key: PDFBOX-3281
> URL: https://issues.apache.org/jira/browse/PDFBOX-3281
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: OS X 10.11.4, Java 1.8.0_73-b02
> Reporter: Aaron Madlon-Kay
> Attachments: testdoc.html, testdoc.pdf
>
>
> When running the command line {{ExtractText}} with the {{-html}} flag, the
> output file always has the following meta tag specifying UTF-16 regardless of
> the actual output encoding:
> {code:html}
> <meta http-equiv="Content-Type" content="text/html; charset="UTF-16">
> {code}
> This causes editors that respect the meta tag (emacs, etc.) to garble the
> file content.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]