[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Apache Wiki Mon, 26 Nov 2018 15:02:01 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=6&rev2=7

  ||zh-cn|| 16,038,691 || 18,231,035 ||13.7%||
  ||zh-tw|| 581,428 || 646,268 ||11.2%||
  
+ When we require that both extracts for a given file have the same language 
id, we see some different patterns [SQL4].
+ 
+ ||Language Id|| pdftotext || Tika/PDFBox ||% Change||
+ ||af|| 10,274 || 10,330 ||0.5%||
+ ||an|| 68,416 || 63,324 ||-7.4%||
+ ||ar|| 2,021,212 || 1,652,127 ||-18.3%||
+ ||ast|| 7,522 || 7,581 ||0.8%||
+ ||be|| 33 || 34 ||3.0%||
+ ||bg|| 20,522 || 20,549 ||0.1%||
+ ||bn|| 4,556 || 4,754 ||4.3%||
+ ||br|| 6,836 || 6,977 ||2.1%||
+ ||ca|| 242,738 || 246,268 ||1.5%||
+ ||cs|| 21,997 || 22,338 ||1.6%||
+ ||cy|| 51,631 || 51,332 ||-0.6%||
+ ||da|| 22,737 || 23,412 ||3.0%||
+ ||de|| 29,463,288 || 29,814,230 ||1.2%||
+ ||el|| 9,427,793 || 9,371,790 ||-0.6%||
+ ||en|| 243,470,025 || 244,529,021 ||0.4%||
+ ||es|| 38,873,448 || 38,832,814 ||-0.1%||
+ ||et|| 16,080 || 16,516 ||2.7%||
+ ||eu|| 11,018 || 11,127 ||1.0%||
+ ||fa|| 19,278,448 || 17,596,803 ||-8.7%||
+ ||fi|| 9,767 || 9,913 ||1.5%||
+ ||fr|| 61,594,316 || 62,041,092 ||0.7%||
+ ||ga|| 22,256 || 22,131 ||-0.6%||
+ ||gl|| 174,194 || 166,007 ||-4.7%||
+ ||gu|| 3,456 || 3,633 ||5.1%||
+ ||he|| 3,378,634 || 3,206,568 ||-5.1%||
+ ||hi|| 265,152 || 242,522 ||-8.5%||
+ ||hr|| 30,124 || 30,354 ||0.8%||
+ ||ht|| 2,798 || 2,869 ||2.5%||
+ ||hu|| 13,260 || 13,595 ||2.5%||
+ ||id|| 190,295 || 190,072 ||-0.1%||
+ ||is|| 10,718 || 10,777 ||0.6%||
+ ||it|| 45,674,555 || 45,574,702 ||-0.2%||
+ ||ja|| 27,391,796 || 28,571,483 ||4.3%||
+ ||km|| 5 || 9 ||80.0%||
+ ||kn|| 3,938 || 3,950 ||0.3%||
+ ||ko|| 4,135,028 || 4,266,709 ||3.2%||
+ ||lt|| 5,587 || 5,684 ||1.7%||
+ ||lv|| 10,786 || 10,435 ||-3.3%||
+ ||mk|| 545 || 1,398 ||156.5%||
+ ||ml|| 1,281 || 1,280 ||-0.1%||
+ ||mr|| 22,695 || 22,523 ||-0.8%||
+ ||ms|| 221,191 || 226,910 ||2.6%||
+ ||mt|| 18,241 || 18,768 ||2.9%||
+ ||ne|| 73 || 83 ||13.7%||
+ ||nl|| 548,128 || 552,215 ||0.7%||
+ ||no|| 40,138 || 41,025 ||2.2%||
+ ||oc|| 605 || 609 ||0.7%||
+ ||pa|| 79 || 107 ||35.4%||
+ ||pl|| 50,848 || 51,776 ||1.8%||
+ ||pt|| 2,090,189 || 2,144,561 ||2.6%||
+ ||ro|| 30,272 || 30,889 ||2.0%||
+ ||ru|| 79,195,271 || 78,271,782 ||-1.2%||
+ ||sk|| 8,745 || 6,776 ||-22.5%||
+ ||sl|| 8,515 || 8,760 ||2.9%||
+ ||so|| 224,340 || 212,438 ||-5.3%||
+ ||sq|| 2,882 || 4,269 ||48.1%||
+ ||sr|| 689 || 703 ||2.0%||
+ ||sv|| 40,347 || 41,313 ||2.4%||
+ ||sw|| 877 || 869 ||-0.9%||
+ ||ta|| 1,308 || 1,303 ||-0.4%||
+ ||te|| 3,360 || 3,407 ||1.4%||
+ ||th|| 5,292 || 5,323 ||0.6%||
+ ||tl|| 1,021 || 1,053 ||3.1%||
+ ||tr|| 865,471 || 878,289 ||1.5%||
+ ||uk|| 3,898 || 5,153 ||32.2%||
+ ||ur|| 21,459 || 5,553 ||-74.1%||
+ ||vi|| 2,243,963 || 2,254,112 ||0.5%||
+ ||yi|| 28 || 32 ||14.3%||
+ ||zh-cn|| 15,768,254 || 16,557,238 ||5.0%||
+ ||zh-tw|| 271,648 || 273,762 ||0.8%||
+ 
+ Further evaluation and analysis are required, but we should look into:
+ 
+  1. Why there are so many "common words" for ''bn'' in the first common 
tokens by language table?
+  2. Are there systematic areas for improvements in PDFBox for ''hi'' (-8.5%) 
and Arabic script languages: ''ar'' (-18%), ''fa'' (-8%), ''ur'' (-74%)?
+  3. Are there systematic areas for improvements in pdftotext CJK languages: 
''ja'' (4%), ''ko'' (3%), ''zh-cn'' (5%), ''zh-tw'' (0.8%)?
+ 
  = Overall improvements to this process =
   * The wrapper around pdftotext should have "caught" the exception written to 
stderr and stored that as we do with exceptions from Tika.
   * Tika currently includes the file's 'title' metadata in the content of the 
file.  This gives the misleading impression that some content was extracted 
from the file when, in fact, only the title was extracted from the XMP or 
metadata.  Next time, we should use a content handler that only includes the 
extracted text.
@@ -186, +266 @@

  order by lang_id_1
  }}}
  
+ [SQL4]
+ {{{
+ select ca.lang_id_1, sum(ca.num_common_tokens)
+ from contents_a ca
+ join contents_b  cb on ca.id=cb.id
+ where ca.lang_id_1=cb.lang_id_1
+ group by ca.lang_id_1
+ order by ca.lang_id_1
+ }}}
+ 
  = How to make sense of the tika-eval reports =
  Exceptions aside, the critical file is 
''content/content_diffs_with_exceptions.xlsx''.  This shows differences in the 
content that was extracted.  Column ''TOP_10_UNIQUE_TOKEN_DIFFS_A'' records the 
top 10 most frequent tokens that appear only in "A" extracts (pdftotext); 
''TOP_10_UNIQUE_TOKEN_DIFFS_B'' records the top 10 most frequent tokens that 
appear only in "B" extracts (Tika/PDFBox); ''NUM_COMMON_TOKENS_DIFF_IN_B'' 
records whether there has been an increase (positive number) or a decrease in 
"common tokens" if one were to move from "A" to "B" as the extraction tool.

[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Reply via email to