[
https://issues.apache.org/jira/browse/PDFBOX-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated PDFBOX-920:
--------------------------------
Attachment: nullcharactername.patch
A patch which fixes the problem for me.
> PDFStreamEngine.processEncodedText fails on UTF-16 text
> -------------------------------------------------------
>
> Key: PDFBOX-920
> URL: https://issues.apache.org/jira/browse/PDFBOX-920
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.3.1
> Reporter: Antoni Mylka
> Attachments: nullcharactername.patch
>
>
> I have a PDF document which yields gibberish text. When I debug it, I get to
> the PDFStreamEngine.processEncodedText. The method gets a following byte
> array:
> [0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72,
> 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88,
> 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3]
> This looks to me like some UTF16 text, but the codes seem different than what
> you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded
> the correct output though ("Look at the picture above"). In the 1.3.1 and the
> current trunk this is converted to garbage. The culprit is here:
> codeLength = 1;
> String c = font.encode( string, i, codeLength );
> if( c == null && i+1<string.length)
> {
> //maybe a multibyte encoding
> codeLength++;
> c = font.encode( string, i, codeLength );
> }
> So the code first tries to 'encode' a single byte as a character, and then
> tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1
> the PDFont.encode would return null. The program would then try with two
> bytes getting a correct character on the second attempt.
> In the current trunk the font.encode method returns a space " " when 00 is
> passed. This is clearly wrong, because afterwards the entire string is parsed
> incorrectly. I tried to debug further and it seems to me that the problem is
> in the Encoding class, in the getName method. It looks like this:
> public String getName( int code ) throws IOException
> {
> String name = codeToName.get( code );
> if( name == null )
> {
> //lets be forgiving for now
> name = "space";
> }
> return name;
> }
> The crucial bit is the "let's be forgiving for now". If a code is unknown in
> the encoding, a space is returned. In my case this completely breaks the
> parsing of a file.
> What was the rationale behind this behavior? Removing it fixed my problem and
> didn't break anything. All unit tests of pdfbox pass. The regression tests of
> my applications (based on the pdf extraction code from the Aperture
> Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but
> the issue description doesn't name any reasons for that. If the "forgiveness"
> is there for a good reason, I'd be grateful for advice how to deal with the
> problem. Otherwise please remove it.
> Unfortunately I can't share the problem file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.