[jira] [Updated] (PDFBOX-3881) Handling of Byte Order Mark with Metadata-Fields

Nico Prenzel (JIRA) Thu, 27 Jul 2017 04:03:00 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nico Prenzel updated PDFBOX-3881:
---------------------------------
    Description: 
PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted 
string and removes the byte order mark signs.

But if the extracted string does only contain the byte order mark signs the 
corresponding string "þÿ" is returned.

Is this the intended solution?
I'd appreciate to remove the byte order mark signs also, if the extracted 
string does only contain these signs.


Problematic code:
{code:java}
public String getString()
  {
  if (this.bytes.length > 2)
    {
      if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, 
Charsets.UTF_16BE);
      }
      if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, 
Charsets.UTF_16LE);
      }
    }
    

    return PDFDocEncoding.toString(this.bytes);
  }
{code}


Attachment has an example pdf


  was:
PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted 
string and removes the byte order mark signs.

But if the extracted string does only contain the byte order mark signs the 
corresponding string "þÿ" is returned.

Is this the intended solution?
I'd appreciate to remove the byte order mark signs also, if the extracted 
string does only contain these signs.



{code:java}
public String getString()
  {
  if (this.bytes.length > 2)
    {
      if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, 
Charsets.UTF_16BE);
      }
      if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, 
Charsets.UTF_16LE);
      }
    }
    

    return PDFDocEncoding.toString(this.bytes);
  }
{code}


Attachment has an example pdf



> Handling of Byte Order Mark with Metadata-Fields
> ------------------------------------------------
>
>                 Key: PDFBOX-3881
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3881
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.7
>         Environment: Windows
>            Reporter: Nico Prenzel
>            Priority: Minor
>         Attachments: ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf
>
>
> PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted 
> string and removes the byte order mark signs.
> But if the extracted string does only contain the byte order mark signs the 
> corresponding string "þÿ" is returned.
> Is this the intended solution?
> I'd appreciate to remove the byte order mark signs also, if the extracted 
> string does only contain these signs.
> Problematic code:
> {code:java}
> public String getString()
>   {
>   if (this.bytes.length > 2)
>     {
>       if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
>       {
>         return new String(this.bytes, 2, this.bytes.length - 2, 
> Charsets.UTF_16BE);
>       }
>       if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
>       {
>         return new String(this.bytes, 2, this.bytes.length - 2, 
> Charsets.UTF_16LE);
>       }
>     }
>     
>     return PDFDocEncoding.toString(this.bytes);
>   }
> {code}
> Attachment has an example pdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3881) Handling of Byte Order Mark with Metadata-Fields

Reply via email to