Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=12&rev2=13 2. The PDF format is page-based + === Tables Aren't Extracted as Tables === + Right. Tables aren't stored as tables in PDF files. A human is easily able to see tables, but all that is stored in the PDF is text chunks and coordinates on a page (if there's any text at all). One needs to apply some advanced computation to extract table structure from a PDF. Tika does not currently do this. Please see [[http://tabula.technology/|TabulaPDF]] as one open source project that extracts tables from PDFs and maintains their structure. + === No Text === === Mildly Garbled Text ===
