[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1575:
--
Fix Version/s: 1.8

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8

 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, 
 content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-24 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: diffs_1_8_9_multithread_vs_single_thread.xlsx

When I loosen the restriction to report all files that have any content diffs 
between 1.8.9 multithreaded vs 1.8.9 single threaded, there are 6 files with 
content diffs.

I _think_ these can be explained by the static PDFont and clearing resources.  
I post this only to share this information.  This should not be viewed as a 
blocker on 1.8.9

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, 
 diffs_1_8_9_multithread_vs_single_thread.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-20 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: reports_1_8_9_multithread_vs_single.zip

I ran 1.8.9 single threaded and compared the output with the multithreaded 
1.8.9 run; same tika-app.jar, same OS.

If you look at the content diffs, 005937 and 524276 are flagged (again).  

But what's really weird is that lang id differs for 491 files.  Lang id works 
on the full string, and my content diff code works on tokens identified by 
Lucene's StandardAnalyzer.  So this suggests that there may be a fairly 
large-ish difference in the non-word characters that is causing language id to 
differ.

Fortunately, all else remains the same: number of attachments, number of 
metadata values, number of exceptions.

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: 005937.pdf.json

Y, I can't find it in Acro Reader with search either, but it was extracted by 
Tika's pdf parser/wrapper with PDFBox 1.8.8.  Looks like it is in a link on p. 
14 to the left of the page.

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 005937.pdf.json, 10-814_Appendix B_v3.pdf, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: 005937_1_8_9-SNAPSHOT.pdf.json

Corrupted characters where monitoring should be.  Given that there are 250k 
files in the set, this may be below the noise.

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip
PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx

[~tilman], thank you, again, for pinging me on the impending release of PDFBox 
1.8.9.  And, also thanks to you, I've turned on the AccessChecker, so you 
shouldn't see any content from files that don't allow extraction.

I ran the most recent eval code against all files that end in a pdf extension 
in govdocs1.

I've included in the xlsx file all files with some kind of an exception or with 
any difference in attachment counts, metadata value counts, lang id or content.

I've also included an example of a static dump of reports from the comparison 
database.  More work remains on that...

I haven't had a chance to join in your earlier comments from our work on the 
1.8.8 release.  Many apologies!

My quick impression:
1) no differences in attachments
2) no differences in metadata values
3) 1.8.9 fixed 3 null pointer exceptions, no new exceptions
4) Content wise:
  a) with 1.8.9 we're getting less form field info (looks like internal 
field names? More digging is required...)
  b) might be actual modest regressions with 
147/147012.pdf
223/223704.pdf


 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-13 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
--
Attachment: 10-814_Appendix B_v3.pdf

Form clutter...This was embedded inside 776568.

With PDFBox 1.8.8, we extracted the keys for the subform (but there was no 
meaningful content in this doc):
{noformat}Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n
19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: 
\n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: 
\n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: 
\n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: 
\n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: 
\n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: 
\n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: 
\n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: 
\n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: 
\n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: 
\n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: 
\n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: 
\n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: 
\n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: 
\n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: 
\n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: 
\n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: 
\n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: 
\n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: 
\n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: 
\n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: 
\n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: 
\n\tCheckBox6[11]: \n\n\n\n\n,{noformat}

In 1.8.9, there's just this:
{noformat}
Briefings\n\nNo\n\n   NWSI 10-814  November 10, 2008\n\n 19\n\n\n\tform1[0]: 
\n\n\n\n
{noformat}


 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: 10-814_Appendix B_v3.pdf, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)