[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1419:
----------------------------------
    Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

Thank you [[email protected]], here's the result of some manual analysing. 
The good news is that I found a few improvements, and only two regressions, and 
no case of "smaller results" like with 1.8.7. Here's some suggestions how the 
automatic analysis could be improved:

- dictionary, or maybe just count a few common english words with at least 
three characters ( https://en.wikipedia.org/wiki/Most_common_words_in_English 
), i.e. to ignore files that are mostly made of trash (although the trash 
changes)
- deleting files from the test set that are known to be corrupt, or won't get 
any useful text even in adobe reader, so that the manual investigation isn't 
done each time.

I analysed only cases where there were no exceptions. Within the next few days, 
I'll investigate some of the cases where there are still exceptions, however 
most of these are corrupt files, that even Adobe Reader doesn't display.

> Upgrade to PDFBox 1.8.7
> -----------------------
>
>                 Key: TIKA-1419
>                 URL: https://issues.apache.org/jira/browse/TIKA-1419
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, 
> compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.zip
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to