Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=10&rev2=11

  
  If you are using Java 8, make sure to see 
[[https://pdfbox.apache.org/2.0/migration.html#pdf-rendering|pdf-rendering]] 
for JVM settings that may improve the speed of processing.
  
+ 
+ == Common Text Extraction Challenges with PDFs ==
+ 
+ This is mostly a stub. The focus of this section is on extracting electronic 
text from the PDF with no OCR.
+ 
+ One could write several volumes on how text extraction from PDFs could go 
wrong.  It would only be poetic justice for said author to print out those 
volumes, pour coffee on the paper, scan them in as PDFs on different scanners, 
some with OCR, some without, at different angles of rotation with user-defined 
fonts randomly deleted.
+ 
+ High level preliminaries:
+ 
+  0. Your matrix algebra (or, your tool's matrix algebra) has to be moderately 
advanced to do text extraction well.
+ 
+  1. The PDF format is display-based not text-based
+    a. One major goal is to display the same content on different devices
+    b. A PDF may be image-only and contain no actual electronic text
+    c. When there is electronic text, there may be no space characters stored 
in the text, rather spaces may appear in the rendering of the image via 
specific coordinates for the characters.
+ 
+  2. The PDF format is page-based
+ 
+ === No Text ===
+ 
+ === Mildly Garbled Text ===
+ 
+ === Completely Garbled Text ===
+ 
+ === No spaces/Extra spaces ===
+ 
+ === Word/Line breaks in the middle of my text ?! ===
+ 
+ === Character Encoding/Unicode Mappings ===
+ 
+ 
+ See also 
[[https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems|diagnosing
 PDF text problems]].
+ 
  == See also ==
  
  Upgrading to [[PDFBOX_2_X_NOTES|PDFBox 2.x]]

Reply via email to