Daniel Bonniot de Ruisselet created PDFBOX-1351:
---------------------------------------------------

             Summary: False paragraph caused by superscript (1.7 regression)
                 Key: PDFBOX-1351
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.7.0
            Reporter: Daniel Bonniot de Ruisselet


On the attached minimal example document, text extraction seems to be confused 
by the superscript, and generates three paragraphs where there is only one.

Note that 1.6 is processing this case well:

{noformat}
$ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
$ cat /tmp/superscript.txt 
  
Multiple synthetic routes have been described by R. Filler et al.11 regarding 
1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
 
 
$ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
$ cat /tmp/superscript.txt                                                 
  
Multiple synthetic routes have been described by R. Filler et al.
11
 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
 
 
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to