Daniel Bonniot de Ruisselet created PDFBOX-1351:
---------------------------------------------------
Summary: False paragraph caused by superscript (1.7 regression)
Key: PDFBOX-1351
URL: https://issues.apache.org/jira/browse/PDFBOX-1351
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.7.0
Reporter: Daniel Bonniot de Ruisselet
On the attached minimal example document, text extraction seems to be confused
by the superscript, and generates three paragraphs where there is only one.
Note that 1.6 is processing this case well:
{noformat}
$ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
$ cat /tmp/superscript.txt
Multiple synthetic routes have been described by R. Filler et al.11 regarding
1,3-
Bis(perfluorophenyl)propane-1,3-dione. The synthesis and
$ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf
Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
$ cat /tmp/superscript.txt
Multiple synthetic routes have been described by R. Filler et al.
11
regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione. The synthesis and
{noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira