[ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981928#comment-14981928
 ] 

Tilman Hausherr commented on PDFBOX-3058:
-----------------------------------------

I have a wish for the next run: a count of total (non unique) common english 
words that are 4 characters or more. I think this could be a better quality 
indicator than the simple count of tokens. We can also use to to answer the 
ultimate question about 2.0 text extraction, whether it extracts more useful 
words than the 1.8 version, by taking sums for all files.

Consider  IHRDZVNWB5VTNAMY2M7VQUVMBN233P6J:

old (5426 tokens):
uif: 384 | pg: 362 | up: 170 | jo: 116 | boe: 101 | jt: 93 | b: 91 | uibu: 86 | 
uifpsz: 55 | uijt: 54

new (120 tokens):
the: 10 | of: 8 | uva: 4 | or: 4 | http: 3 | to: 3 | university: 2 | s: 2 | 
and: 2 | dare: 2

>From the numbers only, it would look as if "old" is better.

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json, 
> NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to