[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085887#comment-18085887
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
tballison commented on code in PR #2861:
URL: https://github.com/apache/tika/pull/2861#discussion_r3351020518
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/CjkDecodeValidator.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.Charset;
+import java.nio.charset.CharsetDecoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CodingErrorAction;
+import java.util.Locale;
+
+import org.apache.tika.detect.CharsetSupersets;
+
+/**
+ * Structural false-CJK veto: measures how badly a probe fails to decode under
a
+ * legacy multi-byte CJK charset, robustly against embedded UTF-8.
+ *
+ * <p>A Latin/Cyrillic/garbage page mis-detected as a legacy CJK charset
decodes
+ * with many malformed/unmappable sequences; real CJK decodes cleanly. Two
+ * corrections make the rate meaningful (see the findings doc):
+ * <ol>
+ * <li>decode under the <em>vendor superset</em> ({@link CharsetSupersets})
so
+ * real vendor-extension chars aren't counted as failures;</li>
+ * <li><strong>discount embedded UTF-8</strong> — mixed-encoding pages
(legacy
+ * CJK body + UTF-8 widgets) would otherwise inflate the rate.
Post-discount,
+ * real CJK (pure or mixed) is ≤1.6% while genuine false-CJK stays
≥5.3%.</li>
+ * </ol>
+ *
+ * <p>The discount is done by a <em>UTF-8-aware single pass</em>, NOT by
physically
+ * stripping UTF-8 runs: a real legacy-CJK char can coincidentally match UTF-8
+ * grammar (e.g. Shift_JIS kanji with lead 0xE0–0xEA), and physically removing
it
+ * would misalign the stream and manufacture failures on genuine CJK. Instead
we
+ * walk the bytes, skip positions that begin a valid UTF-8 sequence, and
decode the
+ * legacy charset in place everywhere else — so real CJK is never misaligned
and
+ * the rate errs toward <em>not</em> vetoing.
+ *
+ * <p>Does NOT catch the legal-but-wrong class (Latin bytes that form
<em>valid</em>
+ * CJK at ~0 failure) — that's the typicality layer's job.
+ */
+public final class CjkDecodeValidator {
+
+ private CjkDecodeValidator() {
+ }
+
+ /** Minimum legacy (non-UTF-8) high bytes required before the rate is
trusted. */
+ public static final int MIN_HIGH_BYTES = 30;
+
+ /**
+ * Failure rate of {@code bytes} under {@code cjkCharset}'s vendor
superset,
+ * counting only legacy high bytes (embedded UTF-8 is skipped, not
counted).
+ *
+ * @return failures / legacy-high-bytes, or {@code -1.0} when there is too
+ * little legacy evidence (legacy high bytes < {@link
#MIN_HIGH_BYTES})
+ */
+ public static double strippedFailureRate(byte[] bytes, Charset cjkCharset)
{
+ Charset decodeAs = CharsetSupersets.supersetOf(cjkCharset);
+ if (decodeAs == null) {
+ decodeAs = cjkCharset;
+ }
+ CharsetDecoder dec = decodeAs.newDecoder()
+ .onMalformedInput(CodingErrorAction.REPORT)
+ .onUnmappableCharacter(CodingErrorAction.REPORT);
+ CharBuffer one = CharBuffer.allocate(1);
+ int i = 0;
+ int n = bytes.length;
+ int fail = 0;
+ int nHigh = 0;
+ while (i < n) {
+ int x = bytes[i] & 0xFF;
+ if (x < 0x80) {
+ i++;
+ continue;
+ }
+ int ulen = utf8SequenceLength(bytes, i);
+ if (ulen > 0) {
+ i += ulen; // embedded UTF-8 — not legacy content, skip
+ continue;
+ }
+ nHigh++;
+ dec.reset();
+ one.clear();
+ ByteBuffer in = ByteBuffer.wrap(bytes, i, Math.min(4, n - i));
+ CoderResult r = dec.decode(in, one, true);
+ if (r.isError()) {
+ fail++;
+ i++;
+ } else {
+ int consumed = in.position() - i;
+ i += Math.max(1, consumed);
+ }
Review Comment:
Claude: This one's a false alarm. ByteBuffer.wrap(byte[], offset, length) is
the three-arg form: it returns a buffer whose position is offset, not 0
(capacity =
▎ array.length, limit = offset + length). Only the single-arg wrap(byte[])
starts at position 0. So after decode, in.position() is i + bytesConsumed, and
▎ consumed = in.position() - i is correct.
▎
▎ Verified directly with bytes = {…,0xC4,0xE3,…}, i = 3: initial position
is 3 (not 0); after decoding the 2-byte GB18030 sequence position is 5, so
▎ in.position() - i = 2 — correct. Using the raw in.position() as you
suggest would yield 5 and over-advance i. Leaving as-is.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)