[jira] [Commented] (TIKA-4731) Ongoing improvements to the junk detector

ASF GitHub Bot (Jira) Tue, 26 May 2026 21:17:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083699#comment-18083699
 ]


ASF GitHub Bot commented on TIKA-4731:
--------------------------------------

Copilot commented on code in PR #2839:
URL: https://github.com/apache/tika/pull/2839#discussion_r3308398037


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/BuildJunkTrainingData.java:
##########
@@ -658,6 +658,11 @@ static String filterSentence(String text, int minBytes, 
double maxPuncFrac,
         if (text.indexOf('\uFFFD') >= 0) {
             return null;
         }
+        // NFD (not NFC) so combining-mark scripts (Vietnamese precomposed,
+        // Indic, Thai) have their marks as separate codepoints in the
+        // training corpus.  Lets per-script bigram tables and z5 (letter-
+        // adjacent-to-mark) discriminate uniformly across mark-using
+        // scripts.  Must match JunkDetector.scoreText's normalization.
         text = Normalizer.normalize(text, Normalizer.Form.NFC);

Review Comment:
   The normalization comment says “NFD (not NFC)” but the code normalizes with 
Normalizer.Form.NFC, and JunkDetector.aggregate() also NFC-normalizes. Please 
fix the comment (or the normalization) so training/inference behavior and the 
rationale match; as written it’s internally inconsistent and misleading for 
future changes.
   



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/AdaptiveProbe.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.commons.io.IOUtils;
+
+import org.apache.tika.io.TikaInputStream;
+
+/**
+ * Reads an encoding-detection probe sized by <em>content</em>, not raw bytes.
+ *
+ * <p>A fixed raw probe (e.g. 16&nbsp;KB) starves the detectors when a page
+ * leads with a large {@code <head>}/inline script: after tag-stripping there's
+ * little body text, and the bytes that distinguish charsets sit past the
+ * window.  This grows the read until ~{@code contentTarget} bytes of
+ * tag-stripped content are present, capped at {@code rawCap} raw bytes.
+ *
+ * <p>Text-rich pages stop early (~one chunk); markup-heavy pages read deeper,
+ * bounded by the cap.  Multi-byte encodings (UTF-16/32) register no ASCII tags
+ * so they stop at {@code contentTarget} raw bytes.
+ */
+public final class AdaptiveProbe {
+
+    /** Default body-content target. */
+    public static final int DEFAULT_CONTENT_TARGET = 16384;
+    /** Default hard ceiling on raw bytes read. */
+    public static final int DEFAULT_RAW_CAP = 524288;
+
+    private AdaptiveProbe() {
+    }
+
+    /**
+     * Reads from {@code tis} (mark/reset preserved) until tag-stripped content
+     * reaches {@code contentTarget}, the raw read reaches {@code rawCap}, or
+     * EOF — whichever first.  Returns the raw bytes read.
+     */
+    public static byte[] read(TikaInputStream tis, int contentTarget, int 
rawCap)
+            throws IOException {
+        tis.mark(rawCap);
+        try {
+            byte[] buf = new byte[rawCap];
+            byte[] stripDst = new byte[rawCap];
+            int total = 0;
+            while (total < rawCap) {
+                int want = Math.min(rawCap - total, contentTarget);
+                int n = IOUtils.read(tis, buf, total, want);
+                total += n;
+                HtmlByteStripper.Result r =
+                        HtmlByteStripper.stripTags(buf, 0, total, stripDst, 0);
+                int content = r.tagCount > 0 ? r.length : total;
+                if (content >= contentTarget || n < want) {
+                    break; // enough body text, or EOF
+                }
+            }

Review Comment:
   AdaptiveProbe.read adds the return value of IOUtils.read() to `total` 
without handling EOF. IOUtils.read returns -1 at EOF, which makes `total` 
negative and can trigger invalid array copies / stripTags calls (this should be 
hit by the empty-input test). Handle `n < 0` (or `n <= 0`) as EOF and break 
without modifying `total`.
   



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -211,62 +268,262 @@ public List<EncodingResult> detect(TikaInputStream tis, 
Metadata metadata,
         return detect(readProbe(tis));
     }
 
+    /** ASCII whitespace: TAB, LF, VT, FF, CR, SPACE. */
+    private static boolean isWhitespace(int b) {
+        return b == 0x09 || b == 0x0a || b == 0x0b || b == 0x0c
+                || b == 0x0d || b == 0x20;
+    }
+
     public List<EncodingResult> detect(byte[] probe) {
-        if (probe == null || probe.length < 2) {
+        ScoreResult sr = scoreClassesAndCount(probe);
+        if (sr == null) {
             return Collections.emptyList();
         }
-        int len = Math.min(probe.length, MAX_PROBE_BYTES);
+        return emitCandidates(sr.scores, sr.scoredBigrams);
+    }
+
+    /**
+     * Score result returned by {@link #scoreClassesAndCount(byte[])}.
+     * Exposes the raw per-class score vector together with the number
+     * of bigrams that actually contributed to the dot product (i.e.,
+     * bigrams with non-zero IDF and not skipped by the whitespace-pair
+     * rule) and the total bigrams in the scored region of the probe.
+     * {@code scoredBigrams} is the unit of "evidence available to NB"
+     * — robust to HTML / whitespace noise in the input because those
+     * bigrams have IDF == 0 and don't contribute.
+     */
+    public static final class ScoreResult {
+        public final double[] scores;
+        public final int scoredBigrams;
+        public final int totalBigrams;
+        public ScoreResult(double[] scores, int scoredBigrams, int 
totalBigrams) {
+            this.scores = scores;
+            this.scoredBigrams = scoredBigrams;
+            this.totalBigrams = totalBigrams;
+        }
+    }
+
+    /**
+     * Compute the raw per-class score vector for a probe, without
+     * top-K extraction or softmax.  Returns {@code null} for null /
+     * tiny probes that can't be scored.
+     */
+    public double[] scoreClasses(byte[] probe) {
+        ScoreResult sr = scoreClassesAndCount(probe);
+        return sr == null ? null : sr.scores;
+    }
+
+    /**
+     * Per-bigram contribution to the per-class score, used for
+     * diagnostic tools that want to understand why a probe scores
+     * one class over another.  Returned by
+     * {@link #analyzeBigrams(byte[], int, int)}.
+     */
+    public static final class BigramContrib {
+        public final int bigram;       // (b0 << 8) | b1
+        public final double contribA;  // logP_A * idf in nats
+        public final double contribB;
+        public BigramContrib(int bigram, double a, double b) {
+            this.bigram = bigram;
+            this.contribA = a;
+            this.contribB = b;
+        }
+        public double diff() {
+            return contribA - contribB;
+        }
+    }
 
-        // Integer hot loop — CharSoup-style.  int8 logP × int8 IDF →
-        // int16 product, accumulated into int32 per class.  Overflow
-        // safety: at MAX_PROBE_BYTES=1024, max 1023 bigrams × 127 × 127
-        // ≈ 16.5M per class, well inside int32's 2.1B headroom.
-        int[] dots = new int[numClasses];
+    /**
+     * For each scored bigram in the probe (same skip rules as
+     * {@link #scoreClasses(byte[])}), compute and return its
+     * dequantized contribution to two specified classes' scores.
+     * The list is in probe order, with duplicates allowed (a bigram
+     * that appears N times in the probe yields N entries).
+     */
+    public List<BigramContrib> analyzeBigrams(byte[] probe, int classA, int 
classB) {
+        List<BigramContrib> out = new java.util.ArrayList<>();
+        if (probe == null || probe.length < 2) {
+            return out;
+        }
+        int len = Math.min(probe.length, MAX_PROBE_BYTES);
+        // perClassDequant[c] folds scale[c] × idfScale already, so
+        // contribution(bigram, c) = logP8[..c] * idf8[bigram] * 
perClassDequant[c]
+        double dqA = perClassDequant[classA];
+        double dqB = perClassDequant[classB];
         for (int i = 0; i + 1 < len; i++) {
-            int bigram = ((probe[i] & 0xFF) << 8) | (probe[i + 1] & 0xFF);
-            int w = idf8[bigram];  // non-negative, 0..127
+            int b0 = probe[i] & 0xFF;
+            int b1 = probe[i + 1] & 0xFF;
+            if (isWhitespace(b0) && isWhitespace(b1)) {
+                continue;
+            }
+            int bigram = (b0 << 8) | b1;
+            int w = idf8[bigram];
             if (w == 0) {
-                continue; // bigram appears in every class; no signal
+                continue;
             }
             int base = bigram * numClasses;
-            for (int c = 0; c < numClasses; c++) {
-                dots[c] += logP8[base + c] * w;
+            double contribA = logP8[base + classA] * w * dqA;
+            double contribB = logP8[base + classB] * w * dqB;
+            out.add(new BigramContrib(bigram, contribA, contribB));
+        }
+        return out;
+    }
+
+    /**
+     * Like {@link #scoreClasses(byte[])} but also reports the number
+     * of bigrams that contributed to the dot product vs the total
+     * scored region.  Used by offline calibration to bucket samples
+     * by "evidence available" rather than raw byte length.
+     */
+    public ScoreResult scoreClassesAndCount(byte[] probe) {
+        if (probe == null || probe.length < 2) {
+            return null;
+        }
+        int len = Math.min(probe.length, MAX_PROBE_BYTES);
+
+        // Pass 1: count distinct bigrams.  Whitespace and zero-IDF
+        // bigrams are skipped as in the original hot loop.  short[] is
+        // enough since count fits in 16383 (max possible).  Track the
+        // ids of distinct bigrams in a parallel array so pass 2 doesn't
+        // need to scan the full 65k space.
+        short[] count = new short[BIGRAM_SPACE];
+        int[] distinctBigrams = new int[len];
+        int distinctIdx = 0;
+        int scored = 0;
+        int total = 0;
+        for (int i = 0; i + 1 < len; i++) {
+            int b0 = probe[i] & 0xFF;
+            int b1 = probe[i + 1] & 0xFF;
+            total++;
+            if (isWhitespace(b0) && isWhitespace(b1)) {
+                continue;
+            }
+            int bigram = (b0 << 8) | b1;
+            int w = idf8[bigram];
+            if (w == 0) {
+                continue;
             }
+            scored++;
+            if (count[bigram] == 0) {
+                distinctBigrams[distinctIdx++] = bigram;
+            }
+            count[bigram]++;
+        }

Review Comment:
   scoreClassesAndCount allocates several large arrays on every detect call 
(notably `new short[BIGRAM_SPACE]` = 65,536 entries) plus `int[len]` and 
`double[numClasses]`. Since this runs in production encoding detection, the 
per-call allocations can become a GC hotspot when parsing many documents. 
Consider reusing these buffers via ThreadLocal or instance fields (with careful 
reset) to keep detection allocation-free like the prior hot loop.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java:
##########
@@ -33,6 +34,9 @@ public void handleNonUnicodeEntryName() throws Exception {
     }
 
     @Test
+    @Disabled("TIKA-4731: tiny SJIS filenames are not reliably detected after 
removal "
+            + "of the cyclic-repeat hack in ZipParser. Re-enable when 
zip-entry-name "
+            + "detection is fixed (separate from chain rework).")
     public void handleEntryNameWithCharsetShiftJIS() throws Exception {

Review Comment:
   Disabling this test papers over a regression in Shift_JIS ZIP entry-name 
detection (tiny filenames). Tests should generally not be disabled to 
accommodate known breakage; please fix ZipParser’s entry-name charset detection 
(or reintroduce the prior stabilization) so this can remain enabled.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/ZipParser.java:
##########
@@ -565,25 +557,12 @@ private String detectEntryName(ZipArchiveEntry entry, 
Metadata parentMetadata,
             return new String(entry.getRawName(), config.getEntryEncoding());
         }
 
-        // If charset detection is enabled, try to detect and decode
+        // If charset detection is enabled, try to detect and decode.
+        // Mojibuster handles short inputs natively (zip filenames are often
+        // 9-30 bytes); no byte-extension trick needed.
         if (config.isDetectCharsetsInEntryNames()) {
             byte[] entryName = entry.getRawName();
-            // Extend short entry names before detection: statistical detectors
-            // (e.g. UniversalEncodingDetector, Icu4j) need enough material to
-            // make a confident call. Cyclically repeat the bytes so the
-            // detector still sees the same byte distribution.
-            byte[] extendedEntryName = entryName;
-            if (entryName != null && 0 < entryName.length
-                    && entryName.length < MIN_BYTES_FOR_DETECTING_CHARSET) {
-                int len = entryName.length
-                        * (MIN_BYTES_FOR_DETECTING_CHARSET / entryName.length);
-                extendedEntryName = new byte[len];
-                for (int i = 0; i < len; i++) {
-                    extendedEntryName[i] = entryName[i % entryName.length];
-                }
-            }
-
-            try (TikaInputStream detectStream = 
TikaInputStream.get(extendedEntryName)) {
+            try (TikaInputStream detectStream = 
TikaInputStream.get(entryName)) {
                 List<EncodingResult> encResults =

Review Comment:
   This change removes the byte-extension (“cyclic repeat”) logic for short 
non-Unicode ZIP entry names, but the integration test for tiny Shift_JIS 
filenames is now disabled because detection is unreliable. That indicates a 
functional regression in entry-name decoding. Either restore a short-input 
stabilization strategy here (or implement a Mojibuster-specific short-name 
path) so Shift_JIS filenames decode reliably, rather than shipping with the 
regression.



##########
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/tools/BuildJunkAugmentationDataTest.java:
##########
@@ -0,0 +1,444 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.junkdetect.tools;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.zip.GZIPInputStream;
+import java.util.zip.GZIPOutputStream;
+
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.metadata.TikaCoreProperties;
+import org.apache.tika.serialization.JsonMetadataList;
+
+class BuildJunkAugmentationDataTest {
+
+    @Test
+    void chunkSplitsLongLinesAtWhitespace() {
+        StringBuilder sb = new StringBuilder();
+        // 1200-char line, single paragraph.
+        for (int i = 0; i < 200; i++) {
+            sb.append("aaaa bbbb ccc ");
+        }
+        List<String> chunks = 
BuildJunkAugmentationData.chunk(sb.toString().strip());
+        assertTrue(chunks.size() >= 2, "expected multiple chunks, got " + 
chunks.size());
+        for (String c : chunks) {
+            assertTrue(c.length() <= BuildJunkAugmentationData.MAX_CHUNK_CHARS,
+                    "chunk over MAX_CHUNK_CHARS: " + c.length());
+        }
+    }
+
+    @Test
+    void chunkGreedilyConcatenatesShortLines() {
+        // HTML-extracted text typically arrives as many short 
newline-separated
+        // fragments. The chunker should pack them into target-sized chunks
+        // instead of emitting each fragment as its own training sample.
+        String input = "Hello world.\nSecond paragraph here.\n\nThird.";
+        List<String> chunks = BuildJunkAugmentationData.chunk(input);
+        // total length 42 chars, well under TARGET_CHUNK_CHARS — single chunk
+        assertEquals(1, chunks.size());
+        assertEquals("Hello world. Second paragraph here. Third.", 
chunks.get(0));
+    }
+
+    @Test
+    void chunkEmitsBufferThenSlicesLongLine() {
+        // Short header line, then a long paragraph: the header should flush
+        // before the long line is sliced.
+        String longLine = "x".repeat(700);
+        String input = "header line\n" + longLine + "\ntail line";
+        List<String> chunks = BuildJunkAugmentationData.chunk(input);
+        // expected: "header line", then 2 slices of the long x-string, then 
"tail line"
+        assertEquals("header line", chunks.get(0));
+        assertEquals("tail line", chunks.get(chunks.size() - 1));
+        // Long-line slices are bounded by MAX_CHUNK_CHARS.
+        for (int i = 1; i < chunks.size() - 1; i++) {
+            assertTrue(chunks.get(i).length() <= 
BuildJunkAugmentationData.MAX_CHUNK_CHARS);
+        }
+    }
+
+    @Test
+    void dominantScriptIdentifiesLatin() {
+        String text = "The quick brown fox jumps over the lazy dog. Copyright 
© 2026.";
+        BuildJunkAugmentationData.DocScript ds =
+                BuildJunkAugmentationData.dominantScript(text);
+        assertEquals(Character.UnicodeScript.LATIN, ds.script);
+        assertTrue(ds.dominance >= 0.99, "expected near-100% LATIN, got " + 
ds.dominance);
+    }
+
+    @Test
+    void dominantScriptIdentifiesMixedTextAsBelowThreshold() {
+        // ~50% Latin, ~50% Han — should fall below the 80% dominance gate.
+        String text = "Hello world 这是中文测试内容 testing 测试更多 abc def ghi 
中文混合内容更多内容";
+        BuildJunkAugmentationData.DocScript ds =
+                BuildJunkAugmentationData.dominantScript(text);
+        assertTrue(ds.dominance < 
BuildJunkAugmentationData.MIN_DOC_SCRIPT_DOMINANCE,
+                "expected mixed-script to fail dominance gate, got " + 
ds.dominance);
+    }
+
+    @Test
+    void dominantScriptReturnsNullOnEmptyContent() {
+        BuildJunkAugmentationData.DocScript ds =
+                BuildJunkAugmentationData.dominantScript("\t\n   ");
+        assertNull(ds.script);
+        assertEquals(0.0, ds.dominance);
+    }
+
+    @Test
+    void scanBaselineLineCountsReadsTrainFilesOnly(@TempDir Path tmp) throws 
Exception {
+        // baseline: latin.train.gz with 3 lines, cyrillic.train.gz with 2,
+        // plus a dev split that should be ignored by the scan.
+        writeGz(tmp.resolve("latin.train.gz"), List.of("alpha", "beta", 
"gamma"));
+        writeGz(tmp.resolve("cyrillic.train.gz"), List.of("один", "два"));
+        writeGz(tmp.resolve("latin.dev.gz"), List.of("dev1", "dev2", "dev3"));
+
+        Map<String, Long> counts =
+                BuildJunkAugmentationData.scanBaselineLineCounts(tmp);
+        assertEquals(2, counts.size());
+        assertEquals(3L, counts.get("latin"));
+        assertEquals(2L, counts.get("cyrillic"));
+    }
+
+    @Test
+    void rewriteTrainWithAppendPreservesOriginalAndAddsLines(@TempDir Path tmp)
+            throws Exception {
+        Path src = tmp.resolve("src.train.gz");
+        Path dst = tmp.resolve("dst.train.gz");
+        writeGz(src, List.of("one", "two", "three"));
+
+        BuildJunkAugmentationData.rewriteTrainWithAppend(src, dst, 
List.of("FOUR", "FIVE"));
+
+        List<String> lines = readGz(dst);
+        assertEquals(List.of("one", "two", "three", "FOUR", "FIVE"), lines);
+    }
+
+    @Test
+    void endToEndAugmentsLatinAndSkipsBelowGate(@TempDir Path tmp) throws 
Exception {
+        // 

> Ongoing improvements to the junk detector
> -----------------------------------------
>
>                 Key: TIKA-4731
>                 URL: https://issues.apache.org/jira/browse/TIKA-4731
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> With [https://github.com/apache/tika/pull/2818,] I think we have a decent 
> shape for the junk detector. 
> There are several areas for improvement, but I think it is ready to go.
> This ticket tracks follow on work, including:
>  * Smaller model
>  * Handling pathological code block changes
>  * Handling candidates with different character counts
>  * Other items to be discovered in our commoncrawl/govdocs1 corpus?
> We have some coverage for the middle two item, but need to address those more 
> directly.
> This work is not a blocker on the 4.0.0-beta-1 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4731) Ongoing improvements to the junk detector

Reply via email to