Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=21&rev2=22 The first language is that identified in the extract from pdftotext, and the second is the language identified on the extract of PDFBox. For example 'en->fa' means that language id returned 'en' on the pdftotext extract, but 'fa' on the Tika/PDFBox extract. ||Language id||Number of Files|| - || en->en || 143784 || + || en->en || 143,784 || - || ru->ru || 44460 || + || ru->ru || 44,460 || - || fr->fr || 38872 || + || fr->fr || 38,872 || - || it->it || 36433 || + || it->it || 36,433 || - || de->de || 30151 || + || de->de || 30,151 || - || es->es || 18335 || + || es->es || 18,335 || - || ja->ja || 16106 || + || ja->ja || 16,106 || - || el->el || 9761 || + || el->el || 9,761 || - || fa->fa || 8486 || + || fa->fa || 8,486 || - || ko->ko || 8213 || + || ko->ko || 8,213 || - || zh-cn->zh-cn || 5815 || + || zh-cn->zh-cn || 5,815 || - || tr->tr || 5477 || + || tr->tr || 5,477 || - || null || 3132 || + || null || 3,132 || - || vi->vi || 2981 || + || vi->vi || 2,981 || - || he->he || 2280 || + || he->he || 2,280 || - || ar->ar || 2087 || + || ar->ar || 2,087 || - || ca->ca || 1275 || + || ca->ca || 1,275 || - || en->fa || 1240 || + || en->fa || 1,240 || - || pt->pt || 1105 || + || pt->pt || 1,105 || || de->en || 860 || In the following, we show the top 10 language id pairs, where the language id differs between the extracts. ||Language ids||Number of Files|| - || en->fa || 1240 || + || en->fa || 1,240 || || de->en || 860 || || en->de || 519 || || en->bn || 392 ||