Re: [PR] TIKA-4745 - small twiddle on charset detection [tika]

via GitHub Tue, 09 Jun 2026 05:42:35 -0700


Copilot commented on code in PR #2886:
URL: https://github.com/apache/tika/pull/2886#discussion_r3380656656



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -102,6 +114,11 @@ public static double strippedFailureRate(byte[] bytes, 
Charset cjkCharset) {
             }
         }
         if (nHigh < MIN_HIGH_BYTES) {
+            // Pure UTF-8: no legacy high bytes at all but enough UTF-8 
sequences
+            // to be confident.  Return 1.0 so the CJK veto fires.
+            if (nHigh == 0 && nUtf8Seqs >= MIN_HIGH_BYTES) {
+                return 1.0;
+            }
             return -1.0;
         }

Review Comment:
   The new “pure UTF-8” special case relies on utf8SequenceLength(), which only 
checks lead/continuation byte ranges and can treat structurally-invalid UTF-8 
(e.g., surrogate-range sequences) as valid. That can incorrectly trigger the 
1.0 veto on inputs that aren’t actually valid UTF-8. Consider additionally 
verifying the probe with StructuralEncodingRules.checkUtf8(...) before 
returning 1.0.



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/MojibusterEncodingDetector.java:
##########
@@ -427,17 +427,30 @@ public List<EncodingResult> detect(byte[] probe, Metadata 
metadata) {
             LOG.trace("mojibuster pool empty -> windows-1252 fallback");
             return windows1252Fallback();
         }
+        // When the top result is STRUCTURAL (clean UTF-8/UTF-32/ISO-2022 
grammar),
+        // return only that one result.  JunkFilter must not re-open 
Mojibuster's
+        // internal ordering and pick a lower-ranked STATISTICAL CJK candidate
+        // over the STRUCTURAL winner on non-languagey content — that was the 
11k
+        // regression root cause.  With a single STRUCTURAL result, JunkFilter
+        // still arbitrates when *another* detector disagrees (lying HTML 
headers),
+        // which is the intended use case.
+        //
+        // When the top result is STATISTICAL, keep the full ranked list so 
that
+        // JunkFilter can arbitrate within-family ambiguities (e.g. GB18030 vs
+        // x-windows-949: NB scores Chinese higher than Korean on JS-heavy 
files
+        // because ASCII bigram distributions differ between training corpora, 
but
+        // JunkFilter's language-quality scoring correctly prefers Korean 
text).
+        EncodingResult top = finalResults.get(0);
+        List<EncodingResult> toReturn = (top.getResultType() == 
EncodingResult.ResultType.STRUCTURAL)
+                ? List.of(top) : finalResults;

Review Comment:
   This changes observable behavior by returning only the top STRUCTURAL 
candidate (dropping Mojibuster’s lower-ranked STATISTICAL candidates). Please 
add a regression test that asserts this contract (e.g., a pure UTF-8 probe 
should yield a single UTF-8 STRUCTURAL result) so the JunkFilter interaction 
that motivated this change stays covered.



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -379,6 +391,13 @@ private static boolean isWhitespace(int b) {
                 || b == 0x0d || b == 0x20;
     }
 
+    // BETA-1 WORKAROUND: bigrams containing these HTML/JS markup chars are
+    // over-represented in GB18030 training data and cause misclassification.
+    // Suppressed only for GB18030 in scoreClassesAndCount.
+    static boolean isOffendingAscii(int b) {
+        return b == '{' || b == '"' || b == '&' || b == '<' || b == '>';
+    }

Review Comment:
   isOffendingAscii() is only used inside this class; making it private avoids 
expanding the class’ surface area unnecessarily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4745 - small twiddle on charset detection [tika]

Reply via email to