Text extracted from a TeX-created PDF file is unintelligible, but not of the
form a1a2a3...
-------------------------------------------------------------------------------------------
Key: PDFBOX-729
URL: https://issues.apache.org/jira/browse/PDFBOX-729
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.1.0
Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText
-encoding UTF-8
Reporter: Thomas Fischer
Text extracted from some PDF files is completely unintelligible, presumably
depending on the software used to create the file. In this example, a
combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create
PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was
used. The text extracted looks like
CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
C
Only rarely some bits and pieces of recognisable formulas are interspersed.
The text copied using either Acrobat Reader or Preview looks different, but is
similarly unintelligible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.