[
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982590#comment-14982590
]
Tim Allison commented on PDFBOX-3058:
-------------------------------------
I've been thinking of something similar (see also TIKA-1443 ;) ).
Two issues: 1) how exactly to define common, and 2) this metric has to be
multilingual, even though our corpus is depressingly monolingual at this point.
What would happen to bilingual docs that were langid'd as the dominant lang?
Hmmm...
Perhaps build a global/multi-lingual "common words" list from Wikipedia - terms
that appear in (say) > 30% of a given lang's docs? Someone has surely already
done this?
Langs as detected by
[lang-detect|https://code.google.com/p/language-detection/] in our ~100k
subcorpus:
||Detected Lang || Number of docs ||
|en |92027|
|null| 1954 |
|es |1167|
|de |757 |
|fr |605|
|it |437 |
|pt |398 |
|bn |218 |
|id |146 |
|ar |145 |
|nl |123 |
|pl |99 |
|vi |60 |
|ru |42|
...
> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
> Key: PDFBOX-3058
> URL: https://issues.apache.org/jira/browse/PDFBOX-3058
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Maruan Sahyoun
> Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json,
> NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]