[ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982590#comment-14982590
 ] 

Tim Allison commented on PDFBOX-3058:
-------------------------------------

I've been thinking of something similar (see also TIKA-1443 ;) ).  

Two issues: 1) how exactly to define common, and 2) this metric has to be 
multilingual, even though our corpus is depressingly monolingual at this point.

What would happen to bilingual docs that were langid'd as the dominant lang? 
Hmmm... 

Perhaps build a global/multi-lingual "common words" list from Wikipedia - terms 
that appear in (say) > 30% of a given lang's docs? Someone has surely already 
done this?

Langs as detected by 
[lang-detect|https://code.google.com/p/language-detection/] in our ~100k 
subcorpus:

||Detected Lang || Number of docs ||
|en |92027| 
|null| 1954 |
|es |1167| 
|de |757 |
|fr |605| 
|it |437 |
|pt |398 |
|bn |218 |
|id |146 |
|ar |145 |
|nl |123 |
|pl |99 |
|vi |60 |
|ru |42|
...

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json, 
> NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to