Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ComparisonTikaAndPDFToText201811" page has been changed by TimothyAllison: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811?action=diff&rev1=15&rev2=16 Most importantly, we need to determine if any of the above areas for inquiry are based on faults in tika-eval that should be fixed. = Follow up Analysis = + - 1. Why there are so many "common words" for ''bn'' in the first common tokens by language table? + == 1. Why there are so many "common words" for ''bn'' in the first common tokens by language table? == I ran [SQL5], and I manually reviewed results. I observed the following points: 1. There was only 1 out of the 100 documents that had what looked like Bangla words in the top 10 most common words for those documents. 2. My intuition from previous experience with Optimaize, and it was confirmed in looking at the top 10 words for these documents is that Optimaize prefers ''bn'' when there are many numerals and very little other language content. 3. As in previous work with Optimaize, I was struck that the confidence levels, which are typically very high (~0.999) even when there is very little content. For example, ''commoncrawl3/7A/7AZUB5NHLJN3TBCMEP2YRSRK6DDNBP5F'' is mostly comprised of the UTF-8 replacement character, "EF BF BD" (equivalent to U+FFFD) (~13,000 of these); there are a few new lines, a few tabs, a few numerals, and the word ''untitled'', and yet Optimaize's confidence is ''0.9999907612800598'' that this is Bangla. 4. The current OOV% metric does not take calculate a confidence. If there's just one alphanumeric term and it happens to be in the dictionary, then the OOV% is 0%, which is less than entirely useful. It would be better to improve our "language-y" score or its inverse, the "junk" score (see [[https://issues.apache.org/jira/browse/TIKA-1443|TIKA-1443]]), to include a confidence interval based on the amount of input. - 5. When tika-eval doesn't have a "common words" list for a language, e.g., ''bn'', it backs-off and uses the English list. Given that the internet is overwhelmingly English and even that the ''commoncrawl3'' regression corpus contains quite a bit of English, and given that content from the title metadata field slipped into the extracted text for the PDFBox/Tika extracts, this backing-off to English can lead to misleading results. + 5. When tika-eval doesn't have a "common words" list for a language, e.g., ''bn'', it backs-off and uses the English list. Given that the internet is overwhelmingly English and given that the ''commoncrawl3'' regression corpus contains quite a bit of English, and given that content from the title metadata field slipped into the extracted text for the PDFBox/Tika extracts, this backing-off to English can lead to misleading results. My conclusion is that most of the documents that received a language id of ''bn'' actually contain a high percentage of junk. Recommendations: 1. We should experiment with other language detectors and evaluate them for the traditional language-id performance measures: accuracy and speed on known language content. However, we should also evaluate them on how well they handle various types of degraded text to confirm that the confidence scores are related to the noise -- content that contains 98% junk text should not receive a language id confidence of 99.999%. 2. We should augment our "common words" lists to cover all languages identified by whichever language detector we choose. We should not back-off to the English list for "common words". - 3. We should continue to work on/develop a junk metric that is more nuanced than the simple OOV%...one that takes into account: + 3. We should continue to work on/develop a junk metric that is more nuanced than the simple sum of "Common Tokens" and the OOV%. The metrics should take the following into account: a. Amount of evidence b. Alignment of distribution of token lengths relative to the "id'd" language (this will be useless with CJK, which are simply bigrammed by tika-eval; but it might be very useful for most other languages). c. Amount of symbols and U+FFFD characters vs. the alphabetic tokens. + d. Instead of binary OOV%, it might be useful to calculate alignment to a Zipf distribution or simply similarity to a language model -- we'd need to include % of words in the common words file. + e. Incorrect duplication of text. For file, ''commoncrawl3/2E/2EXCWC7T6P5ZY6DINFI3X2UQNIMAISKT'', tika-eval shows an increase in Common Tokens of 50,372 tokens if switching from pdftotext to PDFBox/Tika. However, this file has an absurd amount of duplicate text in the headers -- 17,000 occurrences of "training" in the PDFBox/Tika extract, and only 230 in the pdftotext extract. PDFBox/Tika correctly suppresses these duplicate text portions if ''setSuppressDuplicateOverlappingText'' is set to ''true'', but Tika's default is not to suppress duplicate text. One consideration is that for this file, the % of OOV is 39% in pdftotext but only 8% in the text extracted by PDFBox/Tika. This suggests that it might be better, instead of simply summing the common tokens, to sum them only in files which have an OOV% which is within the norm (say, one stddev). As a side note, 40% is fairly common for OOV for English documents -- the median is 45%, and the stddev is 14%. + == 2. Are there systematic areas for improvements in PDFBox for ''hi'' (-8.5%), ''he'' (-5.1%) and Arabic script languages: ''ar'' (-18%), ''fa'' (-8%), ''ur'' (-74%)? == + I don't know these languages, but I ran [SQL7] and then put the contents of ''TOP_10_UNIQUE_TOKEN_DIFFS_A'' and ''TOP_10_UNIQUE_TOKEN_DIFFS_B'' through Google translate. + For example, for the top 10 unique words in ''commoncrawl3_refetched/XH/XHYIWIBT5QPY64UYUPLXZXAYC2I5JPZS'': + {{{ + ميں: 532 | ہے: 520 | كے: 450 | ہيں: 370 | كہ: 365 | كو: 343 | سے: 342 | كا: 297 | ہم: 280 | جناب: 254 + }}} + are translated as: + + {{{ + I: 532 | Is: 520 | Of: 450 | Are: 370 | Yes: 365 | Who: 343 | From: 342 | : 297 | We: 280 | Mr.: 254 + }}} + + Whereas PDFBox/Tika's unique tokens + {{{ + ںيم: 564 | ےہ: 537 | ےك: 468 | ںيہ: 386 | ہك: 365 | وك: 360 | ےس: 348 | اك: 306 | مہ: 281 | انجب: 250 + }}} + + are translated as: + {{{ + Th: 564 | Yes: 537 | S: 468 | Yes: 386 | Hak: 365 | Ki: 360 | S: 348 | A: 306 | Mah: 281 | Ingredients: 250 + }}} + + Overall, this method wasn't able to yield satisfactory insight about general patterns. In some cases, the individual terms looked better in one tool or the other and ''vice versa''. + + I did note that there were more cases in PDFBox's extracted text of numerals concatenated with words as in ''commoncrawl3/JG/JGE6WTYI5SEI3Z4JUULIPSSRTNL3VMIG:'' + + ''TOP_10_UNIQUE_TOKEN_DIFFS_A'' + {{{ + 1: 167 | رياضي: 167 | 9: 44 | 8: 38 | 7: 28 | 6: 16 | 5: 9 | 4: 6 | 3: 2 | 9622243 + }}} + + ''TOP_10_UNIQUE_TOKEN_DIFFS_B'' + {{{ + رياضي: 44 | 8رياضي: 38 | 7رياضي: 28 | 10رياضي: 24 | 6رياضي: 16 | 5رياضي: 9 | 4رياضي: 6 | 3رياضي: 2 | 96222431: 1 + }}} = Post-Study Reflection/Areas for Improvements = @@ -317, +355 @@ limit 100; }}} + [SQL6] + {{{ + select ca.id, file_path, + 1-(cast(ca.num_common_tokens as float) / cast(ca.num_alphabetic_tokens as float)) as OOV_A, + ca.num_alphabetic_tokens, + 1-(cast(cb.num_common_tokens as float) / cast(cb.num_alphabetic_tokens as float)) as OOV_B, + cb.num_alphabetic_tokens, + ca.lang_id_1, ca.lang_id_prob_1, + cb.lang_id_1, cb.lang_id_prob_1, + ca.top_n_tokens, cb.top_n_tokens + from contents_b cb + join contents_a ca on cb.id=ca.id + join profiles_a pa on ca.id=pa.id + join containers c on pa.container_id=c.container_id + where cb.lang_id_1 = 'bn' and + ca.num_alphabetic_tokens > 0 + and cb.num_alphabetic_tokens > 0 + order by OOV_B asc + limit 100; + }}} + [SQL7] + {{{ + select file_path, ca.top_n_tokens, cb.top_n_tokens, + (cb.num_common_tokens-ca.num_common_tokens) as delta_common_tokens, + top_10_unique_token_diffs_a, top_10_unique_token_diffs_b + from contents_a ca + join contents_b cb on ca.id=cb.id + join content_comparisons cc on cc.id=ca.id + join profiles_a pa on ca.id=pa.id + join containers cc on pa.container_id=cc.container_id + where ca.lang_id_1='ur' + and cb.lang_id_1='ur' + order by delta_common_tokens asc + }}} + = How to make sense of the tika-eval reports = Exceptions aside, the critical file is ''content/content_diffs_with_exceptions.xlsx''. This shows differences in the content that was extracted. Column ''TOP_10_UNIQUE_TOKEN_DIFFS_A'' records the top 10 most frequent tokens that appear only in "A" extracts (pdftotext); ''TOP_10_UNIQUE_TOKEN_DIFFS_B'' records the top 10 most frequent tokens that appear only in "B" extracts (Tika/PDFBox); ''NUM_COMMON_TOKENS_DIFF_IN_B'' records whether there has been an increase (positive number) or a decrease in "common tokens" if one were to move from "A" to "B" as the extraction tool.