Am 09.05.2017 um 19:52 schrieb Tilman Hausherr:
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
Content
1) To get a _general_ sense of overall content extract, see "content/
common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k
"common words"[1], which out of 2.6 billion isn't much. However, we also lost
18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika
1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an
improvement.
2) If you want to compare content whether or not one there was a parse
exception, see "content/content_diffs_with_exceptions.xlsx"
3) If you only want to see content diffs where both extracts did not have an
exception, see "content/content_diffs_ignore_exceptions.xlsx".
To make quick sense of the content_diffs_files, sort
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files
lost the most common tokens.
To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP,
which compare the number of unique tokens/tokens in common...a low number
means little similarity, while a number close to 1.0 means that the unigrams
are nearly identical.
From a quick look, many of the files with fewer common words are in the
"likely_broken" and or "truncated" subdirectories... Some exceptions to this
rule include the following, but there are more...and overall, there is a fair
amount of loss from 2.0.3.
govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56
Thanks for the test... three of these four have been fixed, this was yet another
trouble recognizing the end of inline images. All were created by "Leadtools".
The fourth (202097.pdf) is in issue PDFBOX-3785.
Most issues are probably related to truncated files. Some of these do not even
display with Adobe Reader.
I've fixed all remaining regression tickets (in the end it was exactly 1)
@Tim Thanks for running the comparison
@Tilman Thanks for analyzing
Andreas
Tilman
[1] For this version of tika-eval, I expanded Tilman's initial recommendation
of common words for English a bit. I took the top 20k most common words (4
characters or more, except for CJK) for a large number of Wikipedia dumps. I
removed common html markup words (body, form, table) so that failure to strip
html doesn't incorrectly boost scores.
We apply language id and then use the common words for that language. For
example, for
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW
* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there
were 320 common words from the English list of common words.
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Monday, May 8, 2017 10:01 AM
To: [email protected]
Subject: Re: 2.0.6 release ?
Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
Happy to. Will kick off now?
Yes
Tilman
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Saturday, May 6, 2017 10:02 AM
To: [email protected]
Subject: Re: 2.0.6 release ?
Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
Hi,
I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
any objections?
I'm targeting the 15th or 16th
Tim, could you please run your tests when time allows?
Thanks
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For
additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For
additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB
[ X ܚX KK[XZ[
] ][ X ܚX P
\X K ܙ B ܈Y][ۘ[ [X[ K[XZ[
] Z[
\X K ܙ B B
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]