[ https://issues.apache.org/jira/browse/PDFBOX-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-759: -------------------------------------- Attachment: PDFBOX759-Mathematik_Stochastik.txt > Special characters not extracted > -------------------------------- > > Key: PDFBOX-759 > URL: https://issues.apache.org/jira/browse/PDFBOX-759 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.1.0, 1.2.0 > Environment: all > Reporter: Sebastian Freuck > Attachments: Mathematik_Stochastik.pdf, > PDFBOX759-Mathematik_Stochastik.txt > > > When trying to extract characters for mathematic formulas, there appear to be > lots of characters that don't seem to have any meaning. > Take the example on page 80 the last formula with the binomial coefficient. > The first opening bracket, when extracted using the Foxit Reader or Adobe > Reader gets a character with the int value 18 and the closing bracket is the > int value 19. Now when I look at the TextPosition objects using PDFBox, there > is one character to the left of the 5 and that one has the glyph name > spacehackarabic/space and the int value 32. > The next problem is that there seems to be a character at the same position > as the 5, a 'controlLF'. What does it do at the same position as that number? > Mpw after the character 2 are 3 other characters, another 'controlLF' and two > 'spacehackarabic/space'. There is no indication whatsoever abouth the > bracket. What do those extra characters mean? And why doesn't it show the > character for the bracket that I am able to extract using the other PDF > readers? > The PDF can be downloaded from > http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.