[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1575: -- Fix Version/s: 1.8 Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, reports_1_8_9_multithread_vs_single.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: diffs_1_8_9_multithread_vs_single_thread.xlsx When I loosen the restriction to report all files that have any content diffs between 1.8.9 multithreaded vs 1.8.9 single threaded, there are 6 files with content diffs. I _think_ these can be explained by the static PDFont and clearing resources. I post this only to share this information. This should not be viewed as a blocker on 1.8.9 Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, reports_1_8_9_multithread_vs_single.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: reports_1_8_9_multithread_vs_single.zip I ran 1.8.9 single threaded and compared the output with the multithreaded 1.8.9 run; same tika-app.jar, same OS. If you look at the content diffs, 005937 and 524276 are flagged (again). But what's really weird is that lang id differs for 491 files. Lang id works on the full string, and my content diff code works on tokens identified by Lucene's StandardAnalyzer. So this suggests that there may be a fairly large-ish difference in the non-word characters that is causing language id to differ. Fortunately, all else remains the same: number of attachments, number of metadata values, number of exceptions. Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, reports_1_8_9_multithread_vs_single.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: 005937.pdf.json Y, I can't find it in Acro Reader with search either, but it was extracted by Tika's pdf parser/wrapper with PDFBox 1.8.8. Looks like it is in a link on p. 14 to the left of the page. Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 005937.pdf.json, 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: 005937_1_8_9-SNAPSHOT.pdf.json Corrupted characters where monitoring should be. Given that there are 250k files in the set, this may be below the noise. Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx [~tilman], thank you, again, for pinging me on the impending release of PDFBox 1.8.9. And, also thanks to you, I've turned on the AccessChecker, so you shouldn't see any content from files that don't allow extraction. I ran the most recent eval code against all files that end in a pdf extension in govdocs1. I've included in the xlsx file all files with some kind of an exception or with any difference in attachment counts, metadata value counts, lang id or content. I've also included an example of a static dump of reports from the comparison database. More work remains on that... I haven't had a chance to join in your earlier comments from our work on the 1.8.8 release. Many apologies! My quick impression: 1) no differences in attachments 2) no differences in metadata values 3) 1.8.9 fixed 3 null pointer exceptions, no new exceptions 4) Content wise: a) with 1.8.9 we're getting less form field info (looks like internal field names? More digging is required...) b) might be actual modest regressions with 147/147012.pdf 223/223704.pdf Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: 10-814_Appendix B_v3.pdf Form clutter...This was embedded inside 776568. With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc): {noformat}Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n,{noformat} In 1.8.9, there's just this: {noformat} Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n {noformat} Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)