[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

ASF GitHub Bot (Jira) Wed, 03 Jun 2026 11:43:08 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085887#comment-18085887
 ]


ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

tballison commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3351020518


##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CodingErrorAction;
+import java.util.Locale;
+
+import org.apache.tika.detect.CharsetSupersets;
+
+/**
+ * Structural false-CJK veto: measures how badly a probe fails to decode under 
a
+ * legacy multi-byte CJK charset, robustly against embedded UTF-8.
+ *
+ * <p>A Latin/Cyrillic/garbage page mis-detected as a legacy CJK charset 
decodes
+ * with many malformed/unmappable sequences; real CJK decodes cleanly.  Two
+ * corrections make the rate meaningful (see the findings doc):
+ * <ol>
+ *   <li>decode under the <em>vendor superset</em> ({@link CharsetSupersets}) 
so
+ *       real vendor-extension chars aren't counted as failures;</li>
+ *   <li><strong>discount embedded UTF-8</strong> — mixed-encoding pages 
(legacy
+ *       CJK body + UTF-8 widgets) would otherwise inflate the rate.  
Post-discount,
+ *       real CJK (pure or mixed) is ≤1.6% while genuine false-CJK stays 
≥5.3%.</li>
+ * </ol>
+ *
+ * <p>The discount is done by a <em>UTF-8-aware single pass</em>, NOT by 
physically
+ * stripping UTF-8 runs: a real legacy-CJK char can coincidentally match UTF-8
+ * grammar (e.g. Shift_JIS kanji with lead 0xE0–0xEA), and physically removing 
it
+ * would misalign the stream and manufacture failures on genuine CJK.  Instead 
we
+ * walk the bytes, skip positions that begin a valid UTF-8 sequence, and 
decode the
+ * legacy charset in place everywhere else — so real CJK is never misaligned 
and
+ * the rate errs toward <em>not</em> vetoing.
+ *
+ * <p>Does NOT catch the legal-but-wrong class (Latin bytes that form 
<em>valid</em>
+ * CJK at ~0 failure) — that's the typicality layer's job.
+ */
+public final class CjkDecodeValidator {
+
+    private CjkDecodeValidator() {
+    }
+
+    /** Minimum legacy (non-UTF-8) high bytes required before the rate is 
trusted. */
+    public static final int MIN_HIGH_BYTES = 30;
+
+    /**
+     * Failure rate of {@code bytes} under {@code cjkCharset}'s vendor 
superset,
+     * counting only legacy high bytes (embedded UTF-8 is skipped, not 
counted).
+     *
+     * @return failures / legacy-high-bytes, or {@code -1.0} when there is too
+     *         little legacy evidence (legacy high bytes &lt; {@link 
#MIN_HIGH_BYTES})
+     */
+    public static double strippedFailureRate(byte[] bytes, Charset cjkCharset) 
{
+        Charset decodeAs = CharsetSupersets.supersetOf(cjkCharset);
+        if (decodeAs == null) {
+            decodeAs = cjkCharset;
+        }
+        CharsetDecoder dec = decodeAs.newDecoder()
+                .onMalformedInput(CodingErrorAction.REPORT)
+                .onUnmappableCharacter(CodingErrorAction.REPORT);
+        CharBuffer one = CharBuffer.allocate(1);
+        int i = 0;
+        int n = bytes.length;
+        int fail = 0;
+        int nHigh = 0;
+        while (i < n) {
+            int x = bytes[i] & 0xFF;
+            if (x < 0x80) {
+                i++;
+                continue;
+            }
+            int ulen = utf8SequenceLength(bytes, i);
+            if (ulen > 0) {
+                i += ulen; // embedded UTF-8 — not legacy content, skip
+                continue;
+            }
+            nHigh++;
+            dec.reset();
+            one.clear();
+            ByteBuffer in = ByteBuffer.wrap(bytes, i, Math.min(4, n - i));
+            CoderResult r = dec.decode(in, one, true);
+            if (r.isError()) {
+                fail++;
+                i++;
+            } else {
+                int consumed = in.position() - i;
+                i += Math.max(1, consumed);
+            }

Review Comment:
   Claude: This one's a false alarm. ByteBuffer.wrap(byte[], offset, length) is 
the three-arg form: it returns a buffer whose position is offset, not 0 
(capacity = 
     ▎ array.length, limit = offset + length). Only the single-arg wrap(byte[]) 
starts at position 0. So after decode, in.position() is i + bytesConsumed, and 
     ▎ consumed = in.position() - i is correct.
     ▎
     ▎ Verified directly with bytes = {…,0xC4,0xE3,…}, i = 3: initial position 
is 3 (not 0); after decoding the 2-byte GB18030 sequence position is 5, so 
     ▎ in.position() - i = 2 — correct. Using the raw in.position() as you 
suggest would yield 5 and over-advance i. Leaving as-is.
   





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

Reply via email to