Tim Allison created TIKA-4686:
---------------------------------
Summary: Improve within language likelihood scores in 4.x
Key: TIKA-4686
URL: https://issues.apache.org/jira/browse/TIKA-4686
Project: Tika
Issue Type: Task
Reporter: Tim Allison
In the charsoupencodingdetector, we're using raw logits to choose whether a
given decoding is better than another. This doesn't work well with languages
which have distinctive scripts. The intuition, is that there's very high
confidence that this is language x (e.g. chinese) vs english, but we're not
measuring, if this is Chinese, how "Chinese-y" is it.
This is a different score, but we might be able to compute that with our
current weights.
Let's figure out how to do this on this ticket.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)