Tim Allison created TIKA-3162:
---------------------------------
Summary: Figure out cause of "nep" detection in tika-eval's lang
detector
Key: TIKA-3162
URL: https://issues.apache.org/jira/browse/TIKA-3162
Project: Tika
Issue Type: Bug
Reporter: Tim Allison
In the recent regression runs for PDFBox[1] and on some local docs I've been
working with. It is looking like we're getting "nep" for documents that
actually have quite a few English words.
Let's figure out what's going wrong.
[1]https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
--
This message was sent by Atlassian Jira
(v8.3.4#803005)