[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Apache Wiki Tue, 04 Dec 2018 13:15:37 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=15&rev2=16

  Most importantly, we need to determine if any of the above areas for inquiry 
are based on faults in tika-eval that should be fixed.
  
  = Follow up Analysis =
+ 
-   1. Why there are so many "common words" for ''bn'' in the first common 
tokens by language table?
+ == 1. Why there are so many "common words" for ''bn'' in the first common 
tokens by language table? ==
  
  I ran [SQL5], and I manually reviewed results.  I observed the following 
points:
   1. There was only 1 out of the 100 documents that had what looked like 
Bangla words in the top 10 most common words for those documents.  
   2. My intuition from previous experience with Optimaize, and it was 
confirmed in looking at the top 10 words for these documents is that Optimaize 
prefers ''bn'' when there are many numerals and very little other language 
content.  
   3. As in previous work with Optimaize, I was struck that the confidence 
levels, which are typically very high (~0.999) even when there is very little 
content.  For example, ''commoncrawl3/7A/7AZUB5NHLJN3TBCMEP2YRSRK6DDNBP5F'' is 
mostly comprised of the UTF-8 replacement character, "EF BF BD" (equivalent to 
U+FFFD) (~13,000 of these); there are a few new lines, a few tabs, a few 
numerals, and the word ''untitled'', and yet Optimaize's confidence is 
''0.9999907612800598'' that this is Bangla.  
   4. The current OOV% metric does not take calculate a confidence.  If there's 
just one alphanumeric term and it happens to be in the dictionary, then the 
OOV% is 0%, which is less than entirely useful.  It would be better to improve 
our "language-y" score or its inverse, the "junk" score (see 
[[https://issues.apache.org/jira/browse/TIKA-1443|TIKA-1443]]), to include a 
confidence interval based on the amount of input. 
-  5. When tika-eval doesn't have a "common words" list for a language, e.g., 
''bn'', it backs-off and uses the English list.  Given that the internet is 
overwhelmingly English and even that the ''commoncrawl3'' regression corpus 
contains quite a bit of English, and given that content from the title metadata 
field slipped into the extracted text for the PDFBox/Tika extracts, this 
backing-off to English can lead to misleading results.
+  5. When tika-eval doesn't have a "common words" list for a language, e.g., 
''bn'', it backs-off and uses the English list.  Given that the internet is 
overwhelmingly English and given that the ''commoncrawl3'' regression corpus 
contains quite a bit of English, and given that content from the title metadata 
field slipped into the extracted text for the PDFBox/Tika extracts, this 
backing-off to English can lead to misleading results.
  
  My conclusion is that most of the documents that received a language id of 
''bn'' actually contain a high percentage of junk.
  
  Recommendations:
   1. We should experiment with other language detectors and evaluate them for 
the traditional language-id performance measures: accuracy and speed on known 
language content.  However, we should also evaluate them on how well they 
handle various types of degraded text to confirm that the confidence scores are 
related to the noise -- content that contains 98% junk text should not receive 
a language id confidence of 99.999%.
   2. We should augment our "common words" lists to cover all languages 
identified by whichever language detector we choose.  We should not back-off to 
the English list for "common words".
-  3. We should continue to work on/develop a junk metric that is more nuanced 
than the simple OOV%...one that takes into account:
+  3. We should continue to work on/develop a junk metric that is more nuanced 
than the simple sum of "Common Tokens" and the OOV%.  The metrics should take 
the following into account:
   a. Amount of evidence
   b. Alignment of distribution of token lengths relative to the "id'd" 
language (this will be useless with CJK, which are simply bigrammed by 
tika-eval; but it might be very useful for most other languages).
   c. Amount of symbols and U+FFFD characters vs. the alphabetic tokens.
+  d. Instead of binary OOV%, it might be useful to calculate alignment to a 
Zipf distribution or simply similarity to a language model -- we'd need to 
include % of words in the common words file.
+  e. Incorrect duplication of text. For file, 
''commoncrawl3/2E/2EXCWC7T6P5ZY6DINFI3X2UQNIMAISKT'', tika-eval shows an 
increase in Common Tokens of 50,372 tokens if switching from pdftotext to 
PDFBox/Tika.  However, this file has an absurd amount of duplicate text in the 
headers -- 17,000 occurrences of "training" in the PDFBox/Tika extract, and 
only 230 in the pdftotext extract. PDFBox/Tika correctly suppresses these 
duplicate text portions if ''setSuppressDuplicateOverlappingText'' is set to 
''true'', but Tika's default is not to suppress duplicate text.  One 
consideration is that for this file, the % of OOV is 39% in pdftotext but only 
8% in the text extracted by PDFBox/Tika.  This suggests that it might be 
better, instead of simply summing the common tokens, to sum them only in files 
which have an OOV% which is within the norm (say, one stddev).  As a side note, 
40% is fairly common for OOV for English documents -- the median is 45%, and 
the stddev is 14%.
  
+ == 2. Are there systematic areas for improvements in PDFBox for ''hi'' 
(-8.5%), ''he'' (-5.1%) and Arabic script languages: ''ar'' (-18%), ''fa'' 
(-8%), ''ur'' (-74%)? ==
  
+ I don't know these languages, but I ran [SQL7] and then put the contents of 
''TOP_10_UNIQUE_TOKEN_DIFFS_A'' and ''TOP_10_UNIQUE_TOKEN_DIFFS_B'' through 
Google translate. 
+ For example, for the top 10 unique words in 
''commoncrawl3_refetched/XH/XHYIWIBT5QPY64UYUPLXZXAYC2I5JPZS'':
+ {{{
+ ميں: 532 | ہے: 520 | كے: 450 | ہيں: 370 | كہ: 365 | كو: 343 | سے: 342 | كا: 
297 | ہم: 280 | جناب: 254
+ }}} 
  
+ are translated as:
+ 
+ {{{
+ I: 532 | Is: 520 | Of: 450 | Are: 370 | Yes: 365 | Who: 343 | From: 342 | : 
297 | We: 280 | Mr.: 254
+ }}}
+ 
+ Whereas PDFBox/Tika's unique tokens
+ {{{
+ ںيم: 564 | ےہ: 537 | ےك: 468 | ںيہ: 386 | ہك: 365 | وك: 360 | ےس: 348 | اك: 
306 | مہ: 281 | انجب: 250
+ }}}
+ 
+ are translated as:
+ {{{
+ Th: 564 | Yes: 537 | S: 468 | Yes: 386 | Hak: 365 | Ki: 360 | S: 348 | A: 306 
| Mah: 281 | Ingredients: 250
+ }}}
+ 
+ Overall, this method wasn't able to yield satisfactory insight about general 
patterns.  In some cases, the individual terms looked better in one tool or the 
other and ''vice versa''.
+ 
+ I did note that there were more cases in PDFBox's extracted text of numerals 
concatenated with words as in 
''commoncrawl3/JG/JGE6WTYI5SEI3Z4JUULIPSSRTNL3VMIG:''
+ 
+ ''TOP_10_UNIQUE_TOKEN_DIFFS_A''
+ {{{
+       1: 167 | رياضي: 167 | 9: 44 | 8: 38 | 7: 28 | 6: 16 | 5: 9 | 4: 6 | 3: 
2 | 9622243 
+ }}}
+ 
+ ''TOP_10_UNIQUE_TOKEN_DIFFS_B'' 
+ {{{
+         رياضي: 44 | 8رياضي: 38 | 7رياضي: 28 | 10رياضي: 24 | 6رياضي: 16 | 
5رياضي: 9 | 4رياضي: 6 | 3رياضي: 2 | 96222431: 1 
+ }}}
  
  = Post-Study Reflection/Areas for Improvements =
  
@@ -317, +355 @@

  limit 100;
  }}}
  
+ [SQL6]
+ {{{
+ select  ca.id, file_path, 
+ 1-(cast(ca.num_common_tokens as float) / cast(ca.num_alphabetic_tokens as 
float)) as OOV_A,
+ ca.num_alphabetic_tokens,
+ 1-(cast(cb.num_common_tokens as float) / cast(cb.num_alphabetic_tokens as 
float)) as OOV_B,
+ cb.num_alphabetic_tokens,
+ ca.lang_id_1, ca.lang_id_prob_1,
+ cb.lang_id_1, cb.lang_id_prob_1,
+ ca.top_n_tokens, cb.top_n_tokens 
+ from contents_b cb
+ join contents_a ca on cb.id=ca.id
+ join profiles_a pa on ca.id=pa.id
+ join containers c on pa.container_id=c.container_id
+ where cb.lang_id_1 = 'bn' and
+ ca.num_alphabetic_tokens > 0
+ and cb.num_alphabetic_tokens > 0
+ order by OOV_B asc
+ limit 100;
+ }}}
+ [SQL7]
+ {{{
+ select file_path, ca.top_n_tokens, cb.top_n_tokens,
+ (cb.num_common_tokens-ca.num_common_tokens) as delta_common_tokens,
+ top_10_unique_token_diffs_a, top_10_unique_token_diffs_b
+ from contents_a ca 
+ join contents_b cb on ca.id=cb.id
+ join content_comparisons cc on cc.id=ca.id
+ join profiles_a pa on ca.id=pa.id
+ join containers cc on pa.container_id=cc.container_id
+ where ca.lang_id_1='ur'
+ and cb.lang_id_1='ur'
+ order by delta_common_tokens asc
+ }}}
+ 
  = How to make sense of the tika-eval reports =
  Exceptions aside, the critical file is 
''content/content_diffs_with_exceptions.xlsx''.  This shows differences in the 
content that was extracted.  Column ''TOP_10_UNIQUE_TOKEN_DIFFS_A'' records the 
top 10 most frequent tokens that appear only in "A" extracts (pdftotext); 
''TOP_10_UNIQUE_TOKEN_DIFFS_B'' records the top 10 most frequent tokens that 
appear only in "B" extracts (Tika/PDFBox); ''NUM_COMMON_TOKENS_DIFF_IN_B'' 
records whether there has been an increase (positive number) or a decrease in 
"common tokens" if one were to move from "A" to "B" as the extraction tool.

[Tika Wiki] Update of "ComparisonTikaAndPDFToText201811" by TimothyAllison

Reply via email to