RE: 2.0.6 release ?

Allison, Timothy B. Mon, 08 May 2017 18:09:47 -0700

For the reports comparing 2.0.3 with 2.0.5, see 
https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14V1_15.zip

That was a full run against all file types of Tika 1.14 vs 1.15-SNAPSHOT from 
April 25.

-----Original Message-----
From: Allison, Timothy B. [mailto:[email protected]] 
Sent: Monday, May 8, 2017 8:43 PM
To: [email protected]
Subject: RE: 2.0.6 release ?

Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k 
"common words"[1], which out of 2.6 billion isn't much.  However, we also lost 
18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 
1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an 
improvement.

2)  If you want to compare content whether or not one there was a parse 
exception, see "content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an 
exception, see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which 
compare the number of unique tokens/tokens in common...a low number means 
little similarity, while a number close to 1.0 means that the unigrams are 
nearly identical.

From a quick look, many of the files with fewer common words are in the 
"likely_broken" and or "truncated" subdirectories...  Some exceptions to this 
rule include the following, but there are more...and overall, there is a fair 
amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.

 We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Monday, May 8, 2017 10:01 AM
To: [email protected]
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:[email protected]]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: [email protected]
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For 
> additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For 
> additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]

B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]

RE: 2.0.6 release ?

Reply via email to