For the reports comparing 2.0.3 with 2.0.5, see https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14V1_15.zip
That was a full run against all file types of Tika 1.14 vs 1.15-SNAPSHOT from April 25. -----Original Message----- From: Allison, Timothy B. [mailto:[email protected]] Sent: Monday, May 8, 2017 8:43 PM To: [email protected] Subject: RE: 2.0.6 release ? Content 1) To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much. However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement. 2) If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx" 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx". To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens. To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical. From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories... Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3. govdocs1/202/202097.pdf govdocs1/358/358043.pdf commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6 commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56 [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit. I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps. I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores. We apply language id and then use the common words for that language. For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words. * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words. -----Original Message----- From: Tilman Hausherr [mailto:[email protected]] Sent: Monday, May 8, 2017 10:01 AM To: [email protected] Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: > Happy to. Will kick off now? Yes Tilman > > -----Original Message----- > From: Tilman Hausherr [mailto:[email protected]] > Sent: Saturday, May 6, 2017 10:02 AM > To: [email protected] > Subject: Re: 2.0.6 release ? > > Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: >> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >>> Hi, >>> >>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >>> any objections? >> I'm targeting the 15th or 16th > Tim, could you please run your tests when time allows? > > Thanks > > Tilman > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] For > additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] For > additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
