[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084362#comment-18084362
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
tballison commented on code in PR #2848:
URL: https://github.com/apache/tika/pull/2848#discussion_r3324539536
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -454,21 +508,29 @@ public ScoreResult scoreClassesAndCount(byte[] probe) {
}
// logPs are negative; "best" class for the bigram = highest
- // (least negative) contribution after dequant.
+ // (least negative) contribution after dequant. Cap reference
+ // is the best contribution from a class outside top-1's
+ // cohort, so the cap engages on cross-cohort gaps that a
+ // max-vs-overall-runner-up cap missed when multiple classes
+ // in top-1's cohort sat close together.
+ int topClass = -1;
double max = Double.NEGATIVE_INFINITY;
- double secondMax = Double.NEGATIVE_INFINITY;
for (int c = 0; c < numClasses; c++) {
double contrib = logP8[base + c] * countTimesIdf *
perClassDequant[c];
contributions[c] = contrib;
if (contrib > max) {
- secondMax = max;
max = contrib;
- } else if (contrib > secondMax) {
- secondMax = contrib;
+ topClass = c;
+ }
+ }
+ Cohort topCohort = cohorts[topClass];
+ double bestCrossCohort = Double.NEGATIVE_INFINITY;
+ for (int c = 0; c < numClasses; c++) {
+ if (cohorts[c] != topCohort && contributions[c] >
bestCrossCohort) {
+ bestCrossCohort = contributions[c];
Review Comment:
added a load check to throw if there's only one cohort. A classifier with
one category would be ... a choice.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)