Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=9&rev2=10 The tika-eval reports and the full H2 database of comparison results are available here: [[http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/]] = Tools and Data = + * operating system: Linux cloud-server-02 3.10.0-327.10.1.el7.x86_64 #1 SMP Sat Jan 23 04:54:55 EST 2016 x86_64 x86_64 x86_64 GNU/Linux. We relied on the default system fonts. We have made no modifications to the default OS nor installed fonts. + * pdftotext -- we downloaded the most recent available binaries, version 4.00.01, and we followed the directions to install all language modules (see [[https://wiki.apache.org/tika/VirtualMachine#pdftotext|virtual machine]]). We wrote a simple Groovy wrapper to call a new pdftotext process for every file; if no extract file was generated by pdftotext, the Groovy script generated a 0-byte file; also, we forced a timeout after 300 seconds (5 minutes). * Tika/PDFBox -- we used a snapshot version of Tika 1.20, which uses PDFBox 2.0.12. We used the default settings and did not sort by position, etc. We did enable permissions checking so that text was not extracted from PDF files that did not allow text extraction. * Tika identified 528,618 PDF files in the new pull from Common Crawl. Many of these files are truncated, and 6,787 caused permission exceptions (these are either encrypted or they do not allow extraction of text).