[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Apache Wiki Fri, 30 Nov 2018 10:24:32 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=9&rev2=10

  The tika-eval reports and the full H2 database of comparison results are 
available here: [[http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/]]
  
  = Tools and Data =
+  * operating system: Linux cloud-server-02 3.10.0-327.10.1.el7.x86_64 #1 SMP 
Sat Jan 23 04:54:55 EST 2016 x86_64 x86_64 x86_64 GNU/Linux. We relied on the 
default system fonts.  We have made no modifications to the default OS nor 
installed fonts.
+ 
   * pdftotext -- we downloaded the most recent available binaries, version 
4.00.01, and we followed the directions to install all language modules (see 
[[https://wiki.apache.org/tika/VirtualMachine#pdftotext|virtual machine]]). We 
wrote a simple Groovy wrapper to call a new pdftotext process for every file; 
if no extract file was generated by pdftotext, the Groovy script generated a 
0-byte file; also, we forced a timeout after 300 seconds (5 minutes).
   * Tika/PDFBox -- we used a snapshot version of Tika 1.20, which uses PDFBox 
2.0.12.  We used the default settings and did not sort by position, etc.  We 
did enable permissions checking so that text was not extracted from PDF files 
that did not allow text extraction.
   * Tika identified 528,618 PDF files in the new pull from Common Crawl. Many 
of these files are truncated, and 6,787 caused permission exceptions (these are 
either encrypted or they do not allow extraction of text).

[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Reply via email to