Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=10&rev2=11 If you are using Java 8, make sure to see [[https://pdfbox.apache.org/2.0/migration.html#pdf-rendering|pdf-rendering]] for JVM settings that may improve the speed of processing. + + == Common Text Extraction Challenges with PDFs == + + This is mostly a stub. The focus of this section is on extracting electronic text from the PDF with no OCR. + + One could write several volumes on how text extraction from PDFs could go wrong. It would only be poetic justice for said author to print out those volumes, pour coffee on the paper, scan them in as PDFs on different scanners, some with OCR, some without, at different angles of rotation with user-defined fonts randomly deleted. + + High level preliminaries: + + 0. Your matrix algebra (or, your tool's matrix algebra) has to be moderately advanced to do text extraction well. + + 1. The PDF format is display-based not text-based + a. One major goal is to display the same content on different devices + b. A PDF may be image-only and contain no actual electronic text + c. When there is electronic text, there may be no space characters stored in the text, rather spaces may appear in the rendering of the image via specific coordinates for the characters. + + 2. The PDF format is page-based + + === No Text === + + === Mildly Garbled Text === + + === Completely Garbled Text === + + === No spaces/Extra spaces === + + === Word/Line breaks in the middle of my text ?! === + + === Character Encoding/Unicode Mappings === + + + See also [[https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems|diagnosing PDF text problems]]. + == See also == Upgrading to [[PDFBOX_2_X_NOTES|PDFBox 2.x]]
