Special characters not extracted
--------------------------------
Key: PDFBOX-759
URL: https://issues.apache.org/jira/browse/PDFBOX-759
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.1.0, 1.2.0
Environment: all
Reporter: Sebastian Freuck
When trying to extract characters for mathematic formulas, there appear to be
lots of characters that don't seem to have any meaning.
Take the example on page 80 the last formula with the binomial coefficient. The
first opening bracket, when extracted using the Foxit Reader or Adobe Reader
gets a character with the int value 18 and the closing bracket is the int value
19. Now when I look at the TextPosition objects using PDFBox, there is one
character to the left of the 5 and that one has the glyph name
spacehackarabic/space and the int value 32.
The next problem is that there seems to be a character at the same position as
the 5, a 'controlLF'. What does it do at the same position as that number?
Mpw after the character 2 are 3 other characters, another 'controlLF' and two
'spacehackarabic/space'. There is no indication whatsoever abouth the bracket.
What do those extra characters mean? And why doesn't it show the character for
the bracket that I am able to extract using the other PDF readers?
The PDF can be downloaded from
http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.