[
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232029#comment-14232029
]
Tim Allison edited comment on TIKA-1442 at 12/2/14 7:58 PM:
------------------------------------------------------------
[~tilman], mea culpa. That botch was typical of the rest of my day yesterday.
I reran with fresh builds b162 of 1.8.8-SNAPSHOT. I added three extra columns
to help highlight content differences:
If you look at the entry for 005/005260.pdf...
*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_6*
contains the top 10 most frequent tokens that appear in the text extracted via
1.8.6 but not in 1.8.8
{noformat}
originat: 2 | can't: 1 | don't: 1 | editor's: 1 |
leaving: 1 | retroactively: 1 | site's: 1 | stovepiped: 1 | tic's: 1
{noformat}
*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_8-b162-CLASSIC*
contains the top 10 most frequent tokens that appear in 1.8.8 but not in 1.8.6
{noformat}
insideros: 8 | ohelpo: 4 | os: 4 | ooriginatingo: 3 |
osearch: 3 | ooriginat: 2 | opaint: 2 | owholly: 2 |
results.o: 2 | searcho: 2
{noformat}
*TOP_10_TOKEN_DIFFS*
captures the increase or decrease as we move from 1.8.6 to 1.8.8. There are 10
more "o", 8 fewer "insider's", 8 more "insideros", etc.
{noformat}
o: 10 | insider's: -8 | insideros: 8 | search: -5 | help: -4 |
ohelpo: 4 | os: 4 | s: -4 | ooriginatingo: 3 | originating: -3
{noformat}
The eval modifications are hot off the press, and there may be surprises.
As you found, there may be surprises in getting the correct versions of PDFBox,
too. :(
*N.B.*
The diff between "th" and "thy" is explained by Unicode normalization on
þÿ vs þ�...See for example in 955226, this occurs at the beginning of the
document before "CONGRESSIONAL OFFICE BUDGET COST ESTIMATE"
Cheers!
was (Author: [email protected]):
[~tilman], mea culpa. That botch was typical of the rest of my day yesterday.
I reran with fresh builds b162 of 1.8.8-SNAPSHOT. I added three extra columns
to help highlight content differences:
If you look at the entry for 005/005260.pdf...
*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_6*
contains the top 10 most frequent tokens that appear in the text extracted via
1.8.6 but not in 1.8.8
{noformat}
originat: 2 | can't: 1 | don't: 1 | editor's: 1 |
leaving: 1 | retroactively: 1 | site's: 1 | stovepiped: 1 | tic's: 1
{noformat}
*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_8-b162-CLASSIC*
contains the top 10 most frequent tokens that appear in 1.8.8 but not in 1.8.6
{noformat}
insideros: 8 | ohelpo: 4 | os: 4 | ooriginatingo: 3 |
osearch: 3 | ooriginat: 2 | opaint: 2 | owholly: 2 |
results.o: 2 | searcho: 2
{noformat}
*TOP_10_TOKEN_DIFFS*
captures the increase or decrease as we move from 1.8.6 to 1.8.8. There are 10
more "o", 8 fewer "insider's", 8 more "insideros", etc.
{noformat}
o: 10 | insider's: -8 | insideros: 8 | search: -5 | help: -4 |
ohelpo: 4 | os: 4 | s: -4 | ooriginatingo: 3 | originating: -3
{noformat}
The eval modifications are hot off the press, and there may be surprises.
As you found, there may be surprises in getting the correct versions of PDFBox,
too. :(
Cheers!
> Upgrade to PDFBox 1.8.8
> -----------------------
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx,
> PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx,
> PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip,
> PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx,
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx,
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx,
> PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx,
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx,
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to
> 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika
> 1.7. Let's use this issue to carry on the discussion of regression testing
> (if any further discussion is necessary) or any other prep that needs to
> happen before 1.8.8's release.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)