[jira] [Commented] (PDFBOX-3281) HTML output wrongly specifies UTF-16 in header

Tilman Hausherr (JIRA) Tue, 22 Mar 2016 00:02:24 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205917#comment-15205917
 ]


Tilman Hausherr commented on PDFBOX-3281:
-----------------------------------------

PDFText2HTML escapes everything, but ExtractText outputs it in the chosen 
encoding, i.e. using "-encoding utf16" really creates an utf16 file, i.e. with 
1 character per word. The problem is that writeHeader() does not know its own 
encoding but it should. Alternatively we could just forbid the encoding 
parameter, or tell that it is ignored.

> HTML output wrongly specifies UTF-16 in header
> ----------------------------------------------
>
>                 Key: PDFBOX-3281
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3281
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: OS X 10.11.4, Java 1.8.0_73-b02
>            Reporter: Aaron Madlon-Kay
>         Attachments: testdoc.html, testdoc.pdf
>
>
> When running the command line {{ExtractText}} with the {{-html}} flag, the 
> output file always has the following meta tag specifying UTF-16 regardless of 
> the actual output encoding:
> {code:html}
> <meta http-equiv="Content-Type" content="text/html; charset="UTF-16">
> {code}
> This causes editors that respect the meta tag (emacs, etc.) to garble the 
> file content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3281) HTML output wrongly specifies UTF-16 in header

Reply via email to