[
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated TIKA-1419:
----------------------------------
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx
Thank you [[email protected]], here's the result of some manual analysing.
The good news is that I found a few improvements, and only two regressions, and
no case of "smaller results" like with 1.8.7. Here's some suggestions how the
automatic analysis could be improved:
- dictionary, or maybe just count a few common english words with at least
three characters ( https://en.wikipedia.org/wiki/Most_common_words_in_English
), i.e. to ignore files that are mostly made of trash (although the trash
changes)
- deleting files from the test set that are known to be corrupt, or won't get
any useful text even in adobe reader, so that the manual investigation isn't
done each time.
I analysed only cases where there were no exceptions. Within the next few days,
I'll investigate some of the cases where there are still exceptions, however
most of these are corrupt files, that even Adobe Reader doesn't display.
> Upgrade to PDFBox 1.8.7
> -----------------------
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv,
> compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx,
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.zip
>
>
> Will run against govdocs1 early next week and then upgrade if no major
> regressions are found.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)