[jira] Created: (PDFBOX-729) Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...

Thomas Fischer (JIRA) Sun, 16 May 2010 05:29:12 -0700

Text extracted from a TeX-created PDF file is unintelligible, but not of the 
form a1a2a3...
-------------------------------------------------------------------------------------------


                 Key: PDFBOX-729
                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0
         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText 
-encoding UTF-8
            Reporter: Thomas Fischer


Text extracted from some PDF files is completely unintelligible, presumably 
depending on the software used to create the file. In this example, a 
combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create 
PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was 
used. The text extracted looks like

CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
C

Only rarely some bits and pieces of recognisable formulas are interspersed.

The text copied using either Acrobat Reader or Preview looks different, but is 
similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-729) Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...

Reply via email to