[
https://issues.apache.org/jira/browse/PDFBOX-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-759.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.4.0
Assignee: Andreas Lehmkühler
I attached the extracted text. It looks good to me. Especially the mentione
page 80 looks a lot better than the adobe reader copy and paste result.
> Special characters not extracted
> --------------------------------
>
> Key: PDFBOX-759
> URL: https://issues.apache.org/jira/browse/PDFBOX-759
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.1.0, 1.2.0
> Environment: all
> Reporter: Sebastian Freuck
> Assignee: Andreas Lehmkühler
> Fix For: 1.4.0
>
> Attachments: Mathematik_Stochastik.pdf,
> PDFBOX759-Mathematik_Stochastik.txt
>
>
> When trying to extract characters for mathematic formulas, there appear to be
> lots of characters that don't seem to have any meaning.
> Take the example on page 80 the last formula with the binomial coefficient.
> The first opening bracket, when extracted using the Foxit Reader or Adobe
> Reader gets a character with the int value 18 and the closing bracket is the
> int value 19. Now when I look at the TextPosition objects using PDFBox, there
> is one character to the left of the 5 and that one has the glyph name
> spacehackarabic/space and the int value 32.
> The next problem is that there seems to be a character at the same position
> as the 5, a 'controlLF'. What does it do at the same position as that number?
> Mpw after the character 2 are 3 other characters, another 'controlLF' and two
> 'spacehackarabic/space'. There is no indication whatsoever abouth the
> bracket. What do those extra characters mean? And why doesn't it show the
> character for the bracket that I am able to extract using the other PDF
> readers?
> The PDF can be downloaded from
> http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.