[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Apache Wiki Thu, 06 Dec 2018 05:59:03 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=17&rev2=18

   1. We should experiment with other language detectors and evaluate them for 
the traditional language-id performance measures: accuracy and speed on known 
language content.  However, we should also evaluate them on how well they 
handle various types of degraded text to confirm that the confidence scores are 
related to the noise -- content that contains 98% junk text should not receive 
a language id confidence of 99.999%.
   2. We should augment our "common words" lists to cover all languages 
identified by whichever language detector we choose.  We should not back-off to 
the English list for "common words".
   3. We should continue to work on/develop a junk metric that is more nuanced 
than the simple sum of "Common Tokens" and the OOV%.  The metrics should take 
the following into account:
+ 
-  a. Amount of evidence
+   a. Amount of evidence.
+ 
-  b. Alignment of distribution of token lengths relative to the "id'd" 
language (this will be useless with CJK, which are simply bigrammed by 
tika-eval; but it might be very useful for most other languages).
+   b. Alignment of distribution of token lengths relative to the "id'd" 
language (this will be useless with CJK, which are simply bigrammed by 
tika-eval; but it might be very useful for most other languages).
+ 
-  c. Amount of symbols and U+FFFD characters vs. the alphabetic tokens.
+   c. Amount of symbols and U+FFFD characters vs. the alphabetic tokens.
+ 
-  d. Instead of binary OOV%, it might be useful to calculate alignment to a 
Zipf distribution or simply similarity to a language model -- we'd need to 
include % of words in the common words file.
+   d. Instead of binary OOV%, it might be useful to calculate alignment to a 
Zipf distribution or simply similarity to a language model -- we'd need to 
include % of words in the common words file.
+ 
-  e. Incorrect duplication of text. For file, 
''commoncrawl3/2E/2EXCWC7T6P5ZY6DINFI3X2UQNIMAISKT'', tika-eval shows an 
increase in Common Tokens of 50,372 tokens if switching from pdftotext to 
PDFBox/Tika.  However, this file has an absurd amount of duplicate text in the 
headers -- 17,000 occurrences of "training" in the PDFBox/Tika extract, and 
only 230 in the pdftotext extract. PDFBox/Tika correctly suppresses these 
duplicate text portions if ''setSuppressDuplicateOverlappingText'' is set to 
''true'', but Tika's default is not to suppress duplicate text.  One 
consideration is that for this file, the % of OOV is 39% in pdftotext but only 
8% in the text extracted by PDFBox/Tika.  This suggests that it might be 
better, instead of simply summing the common tokens, to sum them only in files 
which have an OOV% which is within the norm (say, one stddev).  As a side note, 
40% is fairly common for OOV for English documents -- the median is 45%, and 
the stddev is 14%.
+   e. Incorrect duplication of text. For file, 
''commoncrawl3/2E/2EXCWC7T6P5ZY6DINFI3X2UQNIMAISKT'', tika-eval shows an 
increase in Common Tokens of 50,372 tokens if switching from pdftotext to 
PDFBox/Tika.  However, this file has an absurd amount of duplicate text in the 
headers -- 17,000 occurrences of "training" in the PDFBox/Tika extract, and 
only 230 in the pdftotext extract. PDFBox/Tika correctly suppresses these 
duplicate text portions if ''setSuppressDuplicateOverlappingText'' is set to 
''true'', but Tika's default is not to suppress duplicate text.  One 
consideration is that for this file, the % of OOV is 39% in pdftotext but only 
8% in the text extracted by PDFBox/Tika.  This suggests that it might be 
better, instead of simply summing the common tokens, to sum them only in files 
which have an OOV% which is within the norm (say, one stddev).  As a side note, 
40% is fairly common for OOV for English documents -- the median is 45%, and 
the stddev is 14%.
  
  == 2. Are there systematic areas for improvements in PDFBox for ''hi'' 
(-8.5%), ''he'' (-5.1%) and Arabic script languages: ''ar'' (-18%), ''fa'' 
(-8%), ''ur'' (-74%)? ==

[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Reply via email to