Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=6&rev2=7 ||zh-cn|| 16,038,691 || 18,231,035 ||13.7%|| ||zh-tw|| 581,428 || 646,268 ||11.2%|| + When we require that both extracts for a given file have the same language id, we see some different patterns [SQL4]. + + ||Language Id|| pdftotext || Tika/PDFBox ||% Change|| + ||af|| 10,274 || 10,330 ||0.5%|| + ||an|| 68,416 || 63,324 ||-7.4%|| + ||ar|| 2,021,212 || 1,652,127 ||-18.3%|| + ||ast|| 7,522 || 7,581 ||0.8%|| + ||be|| 33 || 34 ||3.0%|| + ||bg|| 20,522 || 20,549 ||0.1%|| + ||bn|| 4,556 || 4,754 ||4.3%|| + ||br|| 6,836 || 6,977 ||2.1%|| + ||ca|| 242,738 || 246,268 ||1.5%|| + ||cs|| 21,997 || 22,338 ||1.6%|| + ||cy|| 51,631 || 51,332 ||-0.6%|| + ||da|| 22,737 || 23,412 ||3.0%|| + ||de|| 29,463,288 || 29,814,230 ||1.2%|| + ||el|| 9,427,793 || 9,371,790 ||-0.6%|| + ||en|| 243,470,025 || 244,529,021 ||0.4%|| + ||es|| 38,873,448 || 38,832,814 ||-0.1%|| + ||et|| 16,080 || 16,516 ||2.7%|| + ||eu|| 11,018 || 11,127 ||1.0%|| + ||fa|| 19,278,448 || 17,596,803 ||-8.7%|| + ||fi|| 9,767 || 9,913 ||1.5%|| + ||fr|| 61,594,316 || 62,041,092 ||0.7%|| + ||ga|| 22,256 || 22,131 ||-0.6%|| + ||gl|| 174,194 || 166,007 ||-4.7%|| + ||gu|| 3,456 || 3,633 ||5.1%|| + ||he|| 3,378,634 || 3,206,568 ||-5.1%|| + ||hi|| 265,152 || 242,522 ||-8.5%|| + ||hr|| 30,124 || 30,354 ||0.8%|| + ||ht|| 2,798 || 2,869 ||2.5%|| + ||hu|| 13,260 || 13,595 ||2.5%|| + ||id|| 190,295 || 190,072 ||-0.1%|| + ||is|| 10,718 || 10,777 ||0.6%|| + ||it|| 45,674,555 || 45,574,702 ||-0.2%|| + ||ja|| 27,391,796 || 28,571,483 ||4.3%|| + ||km|| 5 || 9 ||80.0%|| + ||kn|| 3,938 || 3,950 ||0.3%|| + ||ko|| 4,135,028 || 4,266,709 ||3.2%|| + ||lt|| 5,587 || 5,684 ||1.7%|| + ||lv|| 10,786 || 10,435 ||-3.3%|| + ||mk|| 545 || 1,398 ||156.5%|| + ||ml|| 1,281 || 1,280 ||-0.1%|| + ||mr|| 22,695 || 22,523 ||-0.8%|| + ||ms|| 221,191 || 226,910 ||2.6%|| + ||mt|| 18,241 || 18,768 ||2.9%|| + ||ne|| 73 || 83 ||13.7%|| + ||nl|| 548,128 || 552,215 ||0.7%|| + ||no|| 40,138 || 41,025 ||2.2%|| + ||oc|| 605 || 609 ||0.7%|| + ||pa|| 79 || 107 ||35.4%|| + ||pl|| 50,848 || 51,776 ||1.8%|| + ||pt|| 2,090,189 || 2,144,561 ||2.6%|| + ||ro|| 30,272 || 30,889 ||2.0%|| + ||ru|| 79,195,271 || 78,271,782 ||-1.2%|| + ||sk|| 8,745 || 6,776 ||-22.5%|| + ||sl|| 8,515 || 8,760 ||2.9%|| + ||so|| 224,340 || 212,438 ||-5.3%|| + ||sq|| 2,882 || 4,269 ||48.1%|| + ||sr|| 689 || 703 ||2.0%|| + ||sv|| 40,347 || 41,313 ||2.4%|| + ||sw|| 877 || 869 ||-0.9%|| + ||ta|| 1,308 || 1,303 ||-0.4%|| + ||te|| 3,360 || 3,407 ||1.4%|| + ||th|| 5,292 || 5,323 ||0.6%|| + ||tl|| 1,021 || 1,053 ||3.1%|| + ||tr|| 865,471 || 878,289 ||1.5%|| + ||uk|| 3,898 || 5,153 ||32.2%|| + ||ur|| 21,459 || 5,553 ||-74.1%|| + ||vi|| 2,243,963 || 2,254,112 ||0.5%|| + ||yi|| 28 || 32 ||14.3%|| + ||zh-cn|| 15,768,254 || 16,557,238 ||5.0%|| + ||zh-tw|| 271,648 || 273,762 ||0.8%|| + + Further evaluation and analysis are required, but we should look into: + + 1. Why there are so many "common words" for ''bn'' in the first common tokens by language table? + 2. Are there systematic areas for improvements in PDFBox for ''hi'' (-8.5%) and Arabic script languages: ''ar'' (-18%), ''fa'' (-8%), ''ur'' (-74%)? + 3. Are there systematic areas for improvements in pdftotext CJK languages: ''ja'' (4%), ''ko'' (3%), ''zh-cn'' (5%), ''zh-tw'' (0.8%)? + = Overall improvements to this process = * The wrapper around pdftotext should have "caught" the exception written to stderr and stored that as we do with exceptions from Tika. * Tika currently includes the file's 'title' metadata in the content of the file. This gives the misleading impression that some content was extracted from the file when, in fact, only the title was extracted from the XMP or metadata. Next time, we should use a content handler that only includes the extracted text. @@ -186, +266 @@ order by lang_id_1 }}} + [SQL4] + {{{ + select ca.lang_id_1, sum(ca.num_common_tokens) + from contents_a ca + join contents_b cb on ca.id=cb.id + where ca.lang_id_1=cb.lang_id_1 + group by ca.lang_id_1 + order by ca.lang_id_1 + }}} + = How to make sense of the tika-eval reports = Exceptions aside, the critical file is ''content/content_diffs_with_exceptions.xlsx''. This shows differences in the content that was extracted. Column ''TOP_10_UNIQUE_TOKEN_DIFFS_A'' records the top 10 most frequent tokens that appear only in "A" extracts (pdftotext); ''TOP_10_UNIQUE_TOKEN_DIFFS_B'' records the top 10 most frequent tokens that appear only in "B" extracts (Tika/PDFBox); ''NUM_COMMON_TOKENS_DIFF_IN_B'' records whether there has been an increase (positive number) or a decrease in "common tokens" if one were to move from "A" to "B" as the extraction tool.