[
https://issues.apache.org/jira/browse/PDFBOX-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959003#comment-13959003
]
John Hewson commented on PDFBOX-2009:
-------------------------------------
{quote}
Am I correct that if a BOM was determined, the codelength should be set to 2
(and not be changed)?
{quote}
Yes, here's what the PDF32000 spec has to say on the issue:
{quote}
For text strings encoded in Unicode, the first two bytes shall be 254 followed
by 255. These two bytes represent the Unicode byte order marker, U+FEFF,
indicating that the string is encoded in the UTF-16BE (big-endian) encoding
scheme specified in the Unicode standard.
{quote}
> PDFStreamEngine.processEncodedText incorrectly handling UTF-16 text with BOM
> FEFF
> ---------------------------------------------------------------------------------
>
> Key: PDFBOX-2009
> URL: https://issues.apache.org/jira/browse/PDFBOX-2009
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Philip Helger
> Fix For: 2.0.0
>
>
> When having a text print operation like
> <FEFF21222193219103B103A003A6> Tj
> than the PDFStreamEngine.processEncodedText does not handle this correctly.
> Am I correct that if a BOM was determined, the codelength should be set to 2
> (and not be changed)? Or should alternatively simply the BOM be skipped?
> It may be related to PDFBOX-920
--
This message was sent by Atlassian JIRA
(v6.2#6252)