[ 
https://issues.apache.org/jira/browse/PDFBOX-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727884#comment-14727884
 ] 

John Hewson edited comment on PDFBOX-2951 at 9/2/15 7:37 PM:
-------------------------------------------------------------

Unrelated to the bug: Looking at your code, did you realise that you can 
override the specific operator methods of PDFStreamEngine rather than 
processOperator? That way you can take advantage of all of PDFBox's built-in 
operator handling and still add your own custom handling. 
PDFGraphicsStreamEngine provides can extended version which has methods which 
you can override for each graphics operator too.

Also note that you can't measure the width of a string in a PDF like this for 
two reasons:
{code}
 case Tj:
                String text = ((COSString) operands.get(0)).getString();
                try {
                    currentFont.getStringWidth(text);
                } catch (Exception ex) {
                    ex.printStackTrace();
                    System.exit(1);
                }
                return;
{code}

1. Each font has its own encoding but .getString() only supports 
PDFDocEncoding. That's because getString() only works for a "text string" but 
you're dealing with a "content stream string". Only getBytes() is valid with 
these, and PDFont's decode() method is necessary to read those bytes.

2. Because decoding and encoding are not symmetric in PDF, i.e. _encoded text 
=> Unicode => encoded text_ could result in a different input and output. 
That's because the font encoding and Unicode information for characters are 
entirely separate in PDF, and often broken. So if you want to measure 
characters in an existing PDF then you have to avoid any decoding and use the 
encoded character codes directly, via PDFont's getWidth(code) method.

I'd recommend overriding the text and glyph methods in PDFStreamEngine, which 
already does all the hard work for you.


was (Author: jahewson):
Unrelated to the bug: Looking at your code, did you realise that you can 
override the specific operator methods of PDFStreamEngine rather than 
processOperator? That way you can take advantage of all of PDFBox's built-in 
operator handling and still add your own custom handling. 
PDFGraphicsStreamEngine provides can extended version which has methods which 
you can override for each graphics operator too.

Also note that you can't measure the width of a string in a PDF like this for 
two reasons:
{code}
 case Tj:
                String text = ((COSString) operands.get(0)).getString();
                try {
                    currentFont.getStringWidth(text);
                } catch (Exception ex) {
                    ex.printStackTrace();
                    System.exit(1);
                }
                return;
{code}

1. Each font has its own encoding but .getString() only supports 
PDFDocEncoding. That's because getString() only works for a "text string" but 
you're dealing with a "content stream string". Only getBytes() is valid with 
these, and PDFont's decode() method is necessary to read those bytes.

2. Because decoding and encoding are not symmetric in PDF, i.e. _encoded text 
=> Unicode => encoded text_ could result in a different input and output. 
That's because the font encoding and Unicode information for characters are 
entirely separate in PDF, and often broken. So if you want to measure 
characters in an existing PDF then you have to avoid any decoding and use the 
encoded character codes directly, via PDFont's getWidth(code) method.

> quotedbl causes NullPointerException
> ------------------------------------
>
>                 Key: PDFBOX-2951
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2951
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.0
>         Environment: Windows 10 64 bit
>            Reporter: Juergen Uhl
>         Attachments: Test.jar, Test.java, Test.pdf
>
>
> I have a pdf document using (besides others) the font CourierNewPS-BoldMT and 
> text with this font containing a double quote.
> When calling PDFont.encode, this results in a NullPointerException due to the 
> following:
> The font encoding is built using pdf /DIFFERENCES which overwrites the 
> original "quotedbl" at index 34 with an "A". The entries for 
> quotedblbase/left/right are left unchanged. As a result, the inverted font 
> does not contain "quotedbl" as key.
> Within encode, the character code 34 gets assigned the name "quotedbl", which 
> is then not found in the inverse encoding (PDTrueTypeFont.encode -> int code 
> = inverted.get(name))
> Right before this code line causing the NullPointerException, there is a 
> check whether ttf.hasGlyph("quotedbl") (which in this case is false) and, if 
> not, whether ttf.hasGlyph("uni0022") (which in this case is true); however, 
> this has no consequence for the continuation of the code, which then crashes, 
> since inverted.get("quotedbl") is null (which is assigned to an int).
> I believe, this is a bug in PDFBox, but have no idea, whether the handling 
> within encode should be changed (maybe using the "else" part in case 
> ttf.hasGlyph("quotedbl") is false or whether code 34 should be assigned to 
> quotedblbase in the first place, or even something else.
> I attached the file (Test.pdf) where the error occurs and a jar (main is 
> com.juergisApps.pdfConverter.Test) that reproduces the problem.
> You may also see 
> http://stackoverflow.com/questions/7140476/pdf-font-mapping-error
> Juergen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to