[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Tue, 24 Jan 2017 12:51:59 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=12&rev2=13

  
   2. The PDF format is page-based
  
+ === Tables Aren't Extracted as Tables ===
+ Right.  Tables aren't stored as tables in PDF files.  A human is easily able 
to see tables, but all that is stored in the PDF is text chunks and coordinates 
on a page (if there's any text at all).  One needs to apply some advanced 
computation to extract table structure from a PDF.  Tika does not currently do 
this.  Please see [[http://tabula.technology/|TabulaPDF]] as one open source 
project that extracts tables from PDFs and maintains their structure.
+ 
  === No Text ===
  
  === Mildly Garbled Text ===

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to