[
https://issues.apache.org/jira/browse/PDFBOX-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367595#comment-15367595
]
Michael Klink commented on PDFBOX-3403:
---------------------------------------
{quote}
Text extraction has to reassemble the lines of text and uses the widths of the
characters to do this
{quote}
Are you sure?
As far as I can see it does not look up the widths by characters but by glyph
code:
{code:title=org.apache.pdfbox.contentstream.PDFStreamEngine.showText(byte[])}
// decode a character
int before = in.available();
int code = font.readCode(in);
int codeLength = before - in.available();
String unicode = font.toUnicode(code);
...
// get glyph's horizontal and vertical displacements, in text space
Vector w = font.getDisplacement(code);
...
showGlyph(textRenderingMatrix, font, code, unicode, w);
{code}
and
{code:title=org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(Matrix,
PDFont, int, String, Vector)}
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code,
String unicode,
Vector displacement) throws IOException
...
float displacementX = displacement.getX();
// the sorting algorithm is based on the width of the character. As the
displacement
// for vertical characters doesn't provide any suitable value for it,
we have to
// calculate our own
if (font.isVertical())
{
displacementX = font.getWidth(code) / 1000;
...
}
// (modified) combined displacement, this is calculated *without*
taking the character
// spacing and word spacing into account, due to legacy code in
TextStripper
float tx = displacementX * fontSize * horizontalScaling;
float ty = displacement.getY() * fontSize;
{code}
I have to admit, though, that the code is spread across many classes, probably
I overlooked something...
> IllegalArgumentException: Symbolic fonts must have a built-in encoding
> ----------------------------------------------------------------------
>
> Key: PDFBOX-3403
> URL: https://issues.apache.org/jira/browse/PDFBOX-3403
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 2.0.2, 2.0.3, 2.1.0
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
> Fix For: 2.0.3, 2.1.0
>
> Attachments: PDFBOX-3403-XXX.pdf, PDFBOX-3403-YYY.pdf, PDFBOX-3403.pdf
>
>
> Happens with text extraction and rendering:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Symbolic fonts
> must have a built-in encoding
> at
> org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.<init>(DictionaryEncoding.java:113)
> at
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:126)
> at
> org.apache.pdfbox.pdmodel.font.PDType1CFont.<init>(PDType1CFont.java:131)
> at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:60)
> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:123)
> at
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:829)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]