Special characters not extracted
--------------------------------

                 Key: PDFBOX-759
                 URL: https://issues.apache.org/jira/browse/PDFBOX-759
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0, 1.2.0
         Environment: all
            Reporter: Sebastian Freuck


When trying to extract characters for mathematic formulas, there appear to be 
lots of characters that don't seem to have any meaning.
Take the example on page 80 the last formula with the binomial coefficient. The 
first opening bracket, when extracted using the Foxit Reader or Adobe Reader 
gets a character with the int value 18 and the closing bracket is the int value 
19. Now when I look at the TextPosition objects using PDFBox, there is one 
character to the left of the 5 and that one has the glyph name 
spacehackarabic/space and the int value 32. 
The next problem is that there seems to be a character at the same position as 
the 5, a 'controlLF'. What does it do at the same position as that number? 
Mpw after the character 2 are 3 other characters, another 'controlLF' and two 
'spacehackarabic/space'. There is no indication whatsoever abouth the bracket. 
What do those extra characters mean? And why doesn't it show the character for 
the bracket that I am able to extract using the other PDF readers?

The PDF can be downloaded from 
http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to