Tim Allison created TIKA-3162:
---------------------------------

             Summary: Figure out cause of "nep" detection in tika-eval's lang 
detector
                 Key: TIKA-3162
                 URL: https://issues.apache.org/jira/browse/TIKA-3162
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison


In the recent regression runs for PDFBox[1] and on some local docs I've been 
working with.  It is looking like we're getting "nep" for documents that 
actually have quite a few English words.

Let's figure out what's going wrong.

[1]https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to