tballison commented on code in PR #2848:
URL: https://github.com/apache/tika/pull/2848#discussion_r3324539536
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -454,21 +508,29 @@ public ScoreResult scoreClassesAndCount(byte[] probe) {
}
// logPs are negative; "best" class for the bigram = highest
- // (least negative) contribution after dequant.
+ // (least negative) contribution after dequant. Cap reference
+ // is the best contribution from a class outside top-1's
+ // cohort, so the cap engages on cross-cohort gaps that a
+ // max-vs-overall-runner-up cap missed when multiple classes
+ // in top-1's cohort sat close together.
+ int topClass = -1;
double max = Double.NEGATIVE_INFINITY;
- double secondMax = Double.NEGATIVE_INFINITY;
for (int c = 0; c < numClasses; c++) {
double contrib = logP8[base + c] * countTimesIdf *
perClassDequant[c];
contributions[c] = contrib;
if (contrib > max) {
- secondMax = max;
max = contrib;
- } else if (contrib > secondMax) {
- secondMax = contrib;
+ topClass = c;
+ }
+ }
+ Cohort topCohort = cohorts[topClass];
+ double bestCrossCohort = Double.NEGATIVE_INFINITY;
+ for (int c = 0; c < numClasses; c++) {
+ if (cohorts[c] != topCohort && contributions[c] >
bestCrossCohort) {
+ bestCrossCohort = contributions[c];
Review Comment:
added a load check to throw if there's only one cohort. A classifier with
one category would be ... a choice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]