[ https://issues.apache.org/jira/browse/PDFBOX-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105104#comment-16105104 ]
ASF subversion and git services commented on PDFBOX-3881: --------------------------------------------------------- Commit 1803283 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1803283 ] PDFBOX-3881: don't keep BOM for empty strings, as suggested by Nico Prenzel > Handling of Byte Order Mark with Metadata-Fields > ------------------------------------------------ > > Key: PDFBOX-3881 > URL: https://issues.apache.org/jira/browse/PDFBOX-3881 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.7 > Environment: Windows > Reporter: Nico Prenzel > Assignee: Tilman Hausherr > Priority: Minor > Attachments: ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf > > > PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted > string and removes the byte order mark signs. > But if the extracted string does only contain the byte order mark signs the > corresponding string "þÿ" is returned. > Is this the intended solution? > I'd appreciate to remove the byte order mark signs also, if the extracted > string does only contain these signs. > Problematic code: > {code:java} > public String getString() > { > if (this.bytes.length > 2) > { > if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255)) > { > return new String(this.bytes, 2, this.bytes.length - 2, > Charsets.UTF_16BE); > } > if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254)) > { > return new String(this.bytes, 2, this.bytes.length - 2, > Charsets.UTF_16LE); > } > } > > return PDFDocEncoding.toString(this.bytes); > } > {code} > Attachment has an example pdf -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org