Nico Prenzel created PDFBOX-3881:
------------------------------------
Summary: Handling of Byte Order Mark with Metadata-Fields
Key: PDFBOX-3881
URL: https://issues.apache.org/jira/browse/PDFBOX-3881
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.7
Environment: Windows
Reporter: Nico Prenzel
Priority: Minor
Attachments: ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf
PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted
string and removes the byte order mark signs.
But if the extracted string does only contain the byte order mark signs the
corresponding string "þÿ" is returned.
Is this the intended solution?
I'd appreciate to remove the byte order mark signs also, if the extracted
string does only contain these signs.
public String getString()
{
{color:red} if (this.bytes.length > 2){color}
{
if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
{
return new String(this.bytes, 2, this.bytes.length - 2,
Charsets.UTF_16BE);
}
if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
{
return new String(this.bytes, 2, this.bytes.length - 2,
Charsets.UTF_16LE);
}
}
return PDFDocEncoding.toString(this.bytes);
}
Attachment has an example pdf
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]