Copilot commented on code in PR #2886:
URL: https://github.com/apache/tika/pull/2886#discussion_r3380656656
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -102,6 +114,11 @@ public static double strippedFailureRate(byte[] bytes,
Charset cjkCharset) {
}
}
if (nHigh < MIN_HIGH_BYTES) {
+ // Pure UTF-8: no legacy high bytes at all but enough UTF-8
sequences
+ // to be confident. Return 1.0 so the CJK veto fires.
+ if (nHigh == 0 && nUtf8Seqs >= MIN_HIGH_BYTES) {
+ return 1.0;
+ }
return -1.0;
}
Review Comment:
The new “pure UTF-8” special case relies on utf8SequenceLength(), which only
checks lead/continuation byte ranges and can treat structurally-invalid UTF-8
(e.g., surrogate-range sequences) as valid. That can incorrectly trigger the
1.0 veto on inputs that aren’t actually valid UTF-8. Consider additionally
verifying the probe with StructuralEncodingRules.checkUtf8(...) before
returning 1.0.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java:
##########
@@ -427,17 +427,30 @@ public List<EncodingResult> detect(byte[] probe, Metadata
metadata) {
LOG.trace("mojibuster pool empty -> windows-1252 fallback");
return windows1252Fallback();
}
+ // When the top result is STRUCTURAL (clean UTF-8/UTF-32/ISO-2022
grammar),
+ // return only that one result. JunkFilter must not re-open
Mojibuster's
+ // internal ordering and pick a lower-ranked STATISTICAL CJK candidate
+ // over the STRUCTURAL winner on non-languagey content — that was the
11k
+ // regression root cause. With a single STRUCTURAL result, JunkFilter
+ // still arbitrates when *another* detector disagrees (lying HTML
headers),
+ // which is the intended use case.
+ //
+ // When the top result is STATISTICAL, keep the full ranked list so
that
+ // JunkFilter can arbitrate within-family ambiguities (e.g. GB18030 vs
+ // x-windows-949: NB scores Chinese higher than Korean on JS-heavy
files
+ // because ASCII bigram distributions differ between training corpora,
but
+ // JunkFilter's language-quality scoring correctly prefers Korean
text).
+ EncodingResult top = finalResults.get(0);
+ List<EncodingResult> toReturn = (top.getResultType() ==
EncodingResult.ResultType.STRUCTURAL)
+ ? List.of(top) : finalResults;
Review Comment:
This changes observable behavior by returning only the top STRUCTURAL
candidate (dropping Mojibuster’s lower-ranked STATISTICAL candidates). Please
add a regression test that asserts this contract (e.g., a pure UTF-8 probe
should yield a single UTF-8 STRUCTURAL result) so the JunkFilter interaction
that motivated this change stays covered.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -379,6 +391,13 @@ private static boolean isWhitespace(int b) {
|| b == 0x0d || b == 0x20;
}
+ // BETA-1 WORKAROUND: bigrams containing these HTML/JS markup chars are
+ // over-represented in GB18030 training data and cause misclassification.
+ // Suppressed only for GB18030 in scoreClassesAndCount.
+ static boolean isOffendingAscii(int b) {
+ return b == '{' || b == '"' || b == '&' || b == '<' || b == '>';
+ }
Review Comment:
isOffendingAscii() is only used inside this class; making it private avoids
expanding the class’ surface area unnecessarily.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]