[ 
https://issues.apache.org/jira/browse/PDFBOX-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367595#comment-15367595
 ] 

Michael Klink commented on PDFBOX-3403:
---------------------------------------

{quote}
Text extraction has to reassemble the lines of text and uses the widths of the 
characters to do this
{quote}

Are you sure?

As far as I can see it does not look up the widths by characters but by glyph 
code:

{code:title=org.apache.pdfbox.contentstream.PDFStreamEngine.showText(byte[])}
            // decode a character
            int before = in.available();
            int code = font.readCode(in);
            int codeLength = before - in.available();
            String unicode = font.toUnicode(code);
...
            // get glyph's horizontal and vertical displacements, in text space
            Vector w = font.getDisplacement(code);
...
            showGlyph(textRenderingMatrix, font, code, unicode, w);
{code}

and

{code:title=org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(Matrix, 
PDFont, int, String, Vector)}
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, 
String unicode,
                             Vector displacement) throws IOException
...
        float displacementX = displacement.getX();
        // the sorting algorithm is based on the width of the character. As the 
displacement
        // for vertical characters doesn't provide any suitable value for it, 
we have to 
        // calculate our own
        if (font.isVertical())
        {
            displacementX = font.getWidth(code) / 1000;
...
        }
        // (modified) combined displacement, this is calculated *without* 
taking the character
        // spacing and word spacing into account, due to legacy code in 
TextStripper
        float tx = displacementX * fontSize * horizontalScaling;
        float ty = displacement.getY() * fontSize;
{code}

I have to admit, though, that the code is spread across many classes, probably 
I overlooked something...

> IllegalArgumentException: Symbolic fonts must have a built-in encoding
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-3403
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3403
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.2, 2.0.3, 2.1.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.3, 2.1.0
>
>         Attachments: PDFBOX-3403-XXX.pdf, PDFBOX-3403-YYY.pdf, PDFBOX-3403.pdf
>
>
> Happens with text extraction and rendering:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Symbolic fonts 
> must have a built-in encoding
>       at 
> org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.<init>(DictionaryEncoding.java:113)
>       at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:126)
>       at 
> org.apache.pdfbox.pdmodel.font.PDType1CFont.<init>(PDType1CFont.java:131)
>       at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:60)
>       at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:123)
>       at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:829)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to