Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=21&rev2=22

  The first language is that identified in the extract from pdftotext, and the 
second is the language identified on the extract of PDFBox.  For example 
'en->fa' means that language id returned 'en' on the pdftotext extract, but 
'fa' on the Tika/PDFBox extract.
  
  ||Language id||Number of Files||
- || en->en || 143784 ||
+ || en->en || 143,784 ||
- || ru->ru || 44460 ||
+ || ru->ru || 44,460 ||
- || fr->fr || 38872 ||
+ || fr->fr || 38,872 ||
- || it->it || 36433 ||
+ || it->it || 36,433 ||
- || de->de || 30151 ||
+ || de->de || 30,151 ||
- || es->es || 18335 ||
+ || es->es || 18,335 ||
- || ja->ja || 16106 ||
+ || ja->ja || 16,106 ||
- || el->el || 9761 ||
+ || el->el || 9,761 ||
- || fa->fa || 8486 ||
+ || fa->fa || 8,486 ||
- || ko->ko || 8213 ||
+ || ko->ko || 8,213 ||
- || zh-cn->zh-cn || 5815 ||
+ || zh-cn->zh-cn || 5,815 ||
- || tr->tr || 5477 ||
+ || tr->tr || 5,477 ||
- || null || 3132 ||
+ || null || 3,132 ||
- || vi->vi || 2981 ||
+ || vi->vi || 2,981 ||
- || he->he || 2280 ||
+ || he->he || 2,280 ||
- || ar->ar || 2087 ||
+ || ar->ar || 2,087 ||
- || ca->ca || 1275 ||
+ || ca->ca || 1,275 ||
- || en->fa || 1240 ||
+ || en->fa || 1,240 ||
- || pt->pt || 1105 ||
+ || pt->pt || 1,105 ||
  || de->en || 860 ||
  
  In the following, we show the top 10 language id pairs, where the language id 
differs between the extracts.
  
  ||Language ids||Number of Files||
- || en->fa || 1240 ||
+ || en->fa || 1,240 ||
  || de->en || 860 ||
  || en->de || 519 ||
  || en->bn || 392 ||

Reply via email to