Tim Allison created TIKA-4690:
---------------------------------
Summary: Add generative language model in 4.x
Key: TIKA-4690
URL: https://issues.apache.org/jira/browse/TIKA-4690
Project: Tika
Issue Type: Task
Reporter: Tim Allison
Finally realized that we can play all we want with logits from the language
detector, but it is not a great approach for "languagey/junk" detection. On
this ticket, we'll add a generative model trained on the same languages as the
language detector so that we can get a better sense of, for example, "Lang
detector said Thai, how likely is it to actually be Thai?"
--
This message was sent by Atlassian Jira
(v8.20.10#820010)