[
https://issues.apache.org/jira/browse/PDFBOX-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Freuck updated PDFBOX-759:
------------------------------------
Attachment: Mathematik_Stochastik.pdf
pdf on said website
> Special characters not extracted
> --------------------------------
>
> Key: PDFBOX-759
> URL: https://issues.apache.org/jira/browse/PDFBOX-759
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.1.0, 1.2.0
> Environment: all
> Reporter: Sebastian Freuck
> Attachments: Mathematik_Stochastik.pdf
>
>
> When trying to extract characters for mathematic formulas, there appear to be
> lots of characters that don't seem to have any meaning.
> Take the example on page 80 the last formula with the binomial coefficient.
> The first opening bracket, when extracted using the Foxit Reader or Adobe
> Reader gets a character with the int value 18 and the closing bracket is the
> int value 19. Now when I look at the TextPosition objects using PDFBox, there
> is one character to the left of the 5 and that one has the glyph name
> spacehackarabic/space and the int value 32.
> The next problem is that there seems to be a character at the same position
> as the 5, a 'controlLF'. What does it do at the same position as that number?
> Mpw after the character 2 are 3 other characters, another 'controlLF' and two
> 'spacehackarabic/space'. There is no indication whatsoever abouth the
> bracket. What do those extra characters mean? And why doesn't it show the
> character for the bracket that I am able to extract using the other PDF
> readers?
> The PDF can be downloaded from
> http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.