[
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368449#comment-14368449
]
Tim Allison edited comment on TIKA-1575 at 3/19/15 3:35 AM:
------------------------------------------------------------
>From manual review...overall, I'm not sure there is anything glaring, esp
>given that we're testing against ~250k documents.
Based on the More_in_A column, it looks like there are two docs with much more
language content in 1.8.8 vs 1.8.9.
* 005937.pdf is an anomaly that can't be reproduced in a single-threaded
environment. Multithreading "bug" improves extraction?! :)
* 524276.pdf; it looks like much of the first page is duplicated with 1.8.8 but
that 1.8.9 gets junk for the first copy but maintains the content that was
duplicated in 1.8.8.
It looks like there are quite a few documents where "this page is intentionally
left blank" is captured more often in the 1.8.8 output than in the 1.8.9
output. For example, in 473194.pdf, there's the main content, and then the
Bookmarks are dumped at the end of the document in 1.8.8, "this page is
intentionally left blank" correctly appears three times, but it only appears
once in 1.8.9. Not a big loss of information in my opinion, unless it points
to a potential underlying problem...I don't know.
A similar thing happens with 719128.pdf, where the footer is repeated 3 times
with 1.8.8 but is only extracted once with 1.8.9; the correct number should be
4.
There appear to be some differences in AcroForm language -- "Yes, No". In the
one I checked, 496816.pdf, the extraction appears to be more accurate in 1.8.9
vs 1.8.8 {noformat} "Primary: Yes\n\tline: Yes\n\n\tPiggyback:" {noformat} only
has one "Yes" in 1.8.9.
[~tilman], what are you finding?
was (Author: [email protected]):
>From manual review...
Based on the More_in_A column, it looks like there are three docs with much
more language content in 1.8.8 vs 1.8.9.
* 005937.pdf is an anomaly that can't be reproduced in a single-threaded
environment. Multithreading "bug" improves extraction?! :)
* 524276.pdf; it looks like much of the first page is duplicated with 1.8.8 but
that 1.8.9 gets junk for the first copy but maintains the content that was
duplicated in 1.8.8.
It looks like there are quite a few documents where "this page is intentionally
left blank" is captured more often in the 1.8.8 output than in the 1.8.9
output. For example, in 473194.pdf, there's the main content, and then the
Bookmarks are dumped at the end of the document in 1.8.8, "this page is
intentionally left blank" correctly appears three times, but it only appears
once in 1.8.9. Not a big loss of information in my opinion, unless it points
to a potential underlying problem...I don't know.
A similar thing happens with 719128.pdf, where the footer is repeated 3 times
with 1.8.8 but is only extracted once with 1.8.9; the correct number should be
4.
There appear to be some differences in AcroForm language -- "Yes, No". In the
one I checked, 496816.pdf, the extraction appears to be more accurate in 1.8.9
vs 1.8.8 {noformat} "Primary: Yes\n\tline: Yes\n\n\tPiggyback:" {noformat} only
has one "Yes" in 1.8.9.
[~tilman], what are you finding?
> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json,
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip,
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9. Let's use this issue to
> track discussions before the release and to track Tika's upgrade to PDFBox
> 1.8.9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)