[
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371649#comment-14371649
]
Tim Allison commented on TIKA-1575:
-----------------------------------
Thank you, Maruan. I'd be thrilled to have you review our code, as long as you
don't laugh too hard. AcroForm processing starts at line 557
[here|http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java?revision=1663424&view=markup]
Please let us know if we're doing anything crazy...
Y. I'm not sure what the right answer is. Given that we dump the AcroForm
content at the end of the document and don't try to extract it in the proper
order with the regular content, I'm not sure it makes sense to repeat the
footer four times.
> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json,
> 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip,
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx,
> reports_1_8_9_multithread_vs_single.zip
>
>
> The PDFBox community is about to release 1.8.9. Let's use this issue to
> track discussions before the release and to track Tika's upgrade to PDFBox
> 1.8.9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)