Re: [PR] TIKA-4731 - improve charset detection and junk detection [tika]

via GitHub Tue, 26 May 2026 21:16:26 -0700


Copilot commented on code in PR #2839:
URL: https://github.com/apache/tika/pull/2839#discussion_r3308398037



##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/BuildJunkTrainingData.java:
##########
@@ -658,6 +658,11 @@ static String filterSentence(String text, int minBytes, 
double maxPuncFrac,
         if (text.indexOf('\uFFFD') >= 0) {
             return null;
         }
+        // NFD (not NFC) so combining-mark scripts (Vietnamese precomposed,
+        // Indic, Thai) have their marks as separate codepoints in the
+        // training corpus.  Lets per-script bigram tables and z5 (letter-
+        // adjacent-to-mark) discriminate uniformly across mark-using
+        // scripts.  Must match JunkDetector.scoreText's normalization.
         text = Normalizer.normalize(text, Normalizer.Form.NFC);

Review Comment:
   The normalization comment says “NFD (not NFC)” but the code normalizes with 
Normalizer.Form.NFC, and JunkDetector.aggregate() also NFC-normalizes. Please 
fix the comment (or the normalization) so training/inference behavior and the 
rationale match; as written it’s internally inconsistent and misleading for 
future changes.
   



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/AdaptiveProbe.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.commons.io.IOUtils;
+
+import org.apache.tika.io.TikaInputStream;
+
+/**
+ * Reads an encoding-detection probe sized by <em>content</em>, not raw bytes.
+ *
+ * <p>A fixed raw probe (e.g. 16&nbsp;KB) starves the detectors when a page
+ * leads with a large {@code <head>}/inline script: after tag-stripping there's
+ * little body text, and the bytes that distinguish charsets sit past the
+ * window.  This grows the read until ~{@code contentTarget} bytes of
+ * tag-stripped content are present, capped at {@code rawCap} raw bytes.
+ *
+ * <p>Text-rich pages stop early (~one chunk); markup-heavy pages read deeper,
+ * bounded by the cap.  Multi-byte encodings (UTF-16/32) register no ASCII tags
+ * so they stop at {@code contentTarget} raw bytes.
+ */
+public final class AdaptiveProbe {
+
+    /** Default body-content target. */
+    public static final int DEFAULT_CONTENT_TARGET = 16384;
+    /** Default hard ceiling on raw bytes read. */
+    public static final int DEFAULT_RAW_CAP = 524288;
+
+    private AdaptiveProbe() {
+    }
+
+    /**
+     * Reads from {@code tis} (mark/reset preserved) until tag-stripped content
+     * reaches {@code contentTarget}, the raw read reaches {@code rawCap}, or
+     * EOF — whichever first.  Returns the raw bytes read.
+     */
+    public static byte[] read(TikaInputStream tis, int contentTarget, int 
rawCap)
+            throws IOException {
+        tis.mark(rawCap);
+        try {
+            byte[] buf = new byte[rawCap];
+            byte[] stripDst = new byte[rawCap];
+            int total = 0;
+            while (total < rawCap) {
+                int want = Math.min(rawCap - total, contentTarget);
+                int n = IOUtils.read(tis, buf, total, want);
+                total += n;
+                HtmlByteStripper.Result r =
+                        HtmlByteStripper.stripTags(buf, 0, total, stripDst, 0);
+                int content = r.tagCount > 0 ? r.length : total;
+                if (content >= contentTarget || n < want) {
+                    break; // enough body text, or EOF
+                }
+            }

Review Comment:
   AdaptiveProbe.read adds the return value of IOUtils.read() to `total` 
without handling EOF. IOUtils.read returns -1 at EOF, which makes `total` 
negative and can trigger invalid array copies / stripTags calls (this should be 
hit by the empty-input test). Handle `n < 0` (or `n <= 0`) as EOF and break 
without modifying `total`.
   



##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -211,62 +268,262 @@ public List<EncodingResult> detect(TikaInputStream tis, 
Metadata metadata,
         return detect(readProbe(tis));
     }
 
+    /** ASCII whitespace: TAB, LF, VT, FF, CR, SPACE. */
+    private static boolean isWhitespace(int b) {
+        return b == 0x09 || b == 0x0a || b == 0x0b || b == 0x0c
+                || b == 0x0d || b == 0x20;
+    }
+
     public List<EncodingResult> detect(byte[] probe) {
-        if (probe == null || probe.length < 2) {
+        ScoreResult sr = scoreClassesAndCount(probe);
+        if (sr == null) {
             return Collections.emptyList();
         }
-        int len = Math.min(probe.length, MAX_PROBE_BYTES);
+        return emitCandidates(sr.scores, sr.scoredBigrams);
+    }
+
+    /**
+     * Score result returned by {@link #scoreClassesAndCount(byte[])}.
+     * Exposes the raw per-class score vector together with the number
+     * of bigrams that actually contributed to the dot product (i.e.,
+     * bigrams with non-zero IDF and not skipped by the whitespace-pair
+     * rule) and the total bigrams in the scored region of the probe.
+     * {@code scoredBigrams} is the unit of "evidence available to NB"
+     * — robust to HTML / whitespace noise in the input because those
+     * bigrams have IDF == 0 and don't contribute.
+     */
+    public static final class ScoreResult {
+        public final double[] scores;
+        public final int scoredBigrams;
+        public final int totalBigrams;
+        public ScoreResult(double[] scores, int scoredBigrams, int 
totalBigrams) {
+            this.scores = scores;
+            this.scoredBigrams = scoredBigrams;
+            this.totalBigrams = totalBigrams;
+        }
+    }
+
+    /**
+     * Compute the raw per-class score vector for a probe, without
+     * top-K extraction or softmax.  Returns {@code null} for null /
+     * tiny probes that can't be scored.
+     */
+    public double[] scoreClasses(byte[] probe) {
+        ScoreResult sr = scoreClassesAndCount(probe);
+        return sr == null ? null : sr.scores;
+    }
+
+    /**
+     * Per-bigram contribution to the per-class score, used for
+     * diagnostic tools that want to understand why a probe scores
+     * one class over another.  Returned by
+     * {@link #analyzeBigrams(byte[], int, int)}.
+     */
+    public static final class BigramContrib {
+        public final int bigram;       // (b0 << 8) | b1
+        public final double contribA;  // logP_A * idf in nats
+        public final double contribB;
+        public BigramContrib(int bigram, double a, double b) {
+            this.bigram = bigram;
+            this.contribA = a;
+            this.contribB = b;
+        }
+        public double diff() {
+            return contribA - contribB;
+        }
+    }
 
-        // Integer hot loop — CharSoup-style.  int8 logP × int8 IDF →
-        // int16 product, accumulated into int32 per class.  Overflow
-        // safety: at MAX_PROBE_BYTES=1024, max 1023 bigrams × 127 × 127
-        // ≈ 16.5M per class, well inside int32's 2.1B headroom.
-        int[] dots = new int[numClasses];
+    /**
+     * For each scored bigram in the probe (same skip rules as
+     * {@link #scoreClasses(byte[])}), compute and return its
+     * dequantized contribution to two specified classes' scores.
+     * The list is in probe order, with duplicates allowed (a bigram
+     * that appears N times in the probe yields N entries).
+     */
+    public List<BigramContrib> analyzeBigrams(byte[] probe, int classA, int 
classB) {
+        List<BigramContrib> out = new java.util.ArrayList<>();
+        if (probe == null || probe.length < 2) {
+            return out;
+        }
+        int len = Math.min(probe.length, MAX_PROBE_BYTES);
+        // perClassDequant[c] folds scale[c] × idfScale already, so
+        // contribution(bigram, c) = logP8[..c] * idf8[bigram] * 
perClassDequant[c]
+        double dqA = perClassDequant[classA];
+        double dqB = perClassDequant[classB];
         for (int i = 0; i + 1 < len; i++) {
-            int bigram = ((probe[i] & 0xFF) << 8) | (probe[i + 1] & 0xFF);
-            int w = idf8[bigram];  // non-negative, 0..127
+            int b0 = probe[i] & 0xFF;
+            int b1 = probe[i + 1] & 0xFF;
+            if (isWhitespace(b0) && isWhitespace(b1)) {
+                continue;
+            }
+            int bigram = (b0 << 8) | b1;
+            int w = idf8[bigram];
             if (w == 0) {
-                continue; // bigram appears in every class; no signal
+                continue;
             }
             int base = bigram * numClasses;
-            for (int c = 0; c < numClasses; c++) {
-                dots[c] += logP8[base + c] * w;
+            double contribA = logP8[base + classA] * w * dqA;
+            double contribB = logP8[base + classB] * w * dqB;
+            out.add(new BigramContrib(bigram, contribA, contribB));
+        }
+        return out;
+    }
+
+    /**
+     * Like {@link #scoreClasses(byte[])} but also reports the number
+     * of bigrams that contributed to the dot product vs the total
+     * scored region.  Used by offline calibration to bucket samples
+     * by "evidence available" rather than raw byte length.
+     */
+    public ScoreResult scoreClassesAndCount(byte[] probe) {
+        if (probe == null || probe.length < 2) {
+            return null;
+        }
+        int len = Math.min(probe.length, MAX_PROBE_BYTES);
+
+        // Pass 1: count distinct bigrams.  Whitespace and zero-IDF
+        // bigrams are skipped as in the original hot loop.  short[] is
+        // enough since count fits in 16383 (max possible).  Track the
+        // ids of distinct bigrams in a parallel array so pass 2 doesn't
+        // need to scan the full 65k space.
+        short[] count = new short[BIGRAM_SPACE];
+        int[] distinctBigrams = new int[len];
+        int distinctIdx = 0;
+        int scored = 0;
+        int total = 0;
+        for (int i = 0; i + 1 < len; i++) {
+            int b0 = probe[i] & 0xFF;
+            int b1 = probe[i + 1] & 0xFF;
+            total++;
+            if (isWhitespace(b0) && isWhitespace(b1)) {
+                continue;
+            }
+            int bigram = (b0 << 8) | b1;
+            int w = idf8[bigram];
+            if (w == 0) {
+                continue;
             }
+            scored++;
+            if (count[bigram] == 0) {
+                distinctBigrams[distinctIdx++] = bigram;
+            }
+            count[bigram]++;
+        }

Review Comment:
   scoreClassesAndCount allocates several large arrays on every detect call 
(notably `new short[BIGRAM_SPACE]` = 65,536 entries) plus `int[len]` and 
`double[numClasses]`. Since this runs in production encoding detection, the 
per-call allocations can become a GC hotspot when parsing many documents. 
Consider reusing these buffers via ThreadLocal or instance fields (with careful 
reset) to keep detection allocation-free like the prior hot loop.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java:
##########
@@ -33,6 +34,9 @@ public void handleNonUnicodeEntryName() throws Exception {
     }
 
     @Test
+    @Disabled("TIKA-4731: tiny SJIS filenames are not reliably detected after 
removal "
+            + "of the cyclic-repeat hack in ZipParser. Re-enable when 
zip-entry-name "
+            + "detection is fixed (separate from chain rework).")
     public void handleEntryNameWithCharsetShiftJIS() throws Exception {

Review Comment:
   Disabling this test papers over a regression in Shift_JIS ZIP entry-name 
detection (tiny filenames). Tests should generally not be disabled to 
accommodate known breakage; please fix ZipParser’s entry-name charset detection 
(or reintroduce the prior stabilization) so this can remain enabled.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/ZipParser.java:
##########
@@ -565,25 +557,12 @@ private String detectEntryName(ZipArchiveEntry entry, 
Metadata parentMetadata,
             return new String(entry.getRawName(), config.getEntryEncoding());
         }
 
-        // If charset detection is enabled, try to detect and decode
+        // If charset detection is enabled, try to detect and decode.
+        // Mojibuster handles short inputs natively (zip filenames are often
+        // 9-30 bytes); no byte-extension trick needed.
         if (config.isDetectCharsetsInEntryNames()) {
             byte[] entryName = entry.getRawName();
-            // Extend short entry names before detection: statistical detectors
-            // (e.g. UniversalEncodingDetector, Icu4j) need enough material to
-            // make a confident call. Cyclically repeat the bytes so the
-            // detector still sees the same byte distribution.
-            byte[] extendedEntryName = entryName;
-            if (entryName != null && 0 < entryName.length
-                    && entryName.length < MIN_BYTES_FOR_DETECTING_CHARSET) {
-                int len = entryName.length
-                        * (MIN_BYTES_FOR_DETECTING_CHARSET / entryName.length);
-                extendedEntryName = new byte[len];
-                for (int i = 0; i < len; i++) {
-                    extendedEntryName[i] = entryName[i % entryName.length];
-                }
-            }
-
-            try (TikaInputStream detectStream = 
TikaInputStream.get(extendedEntryName)) {
+            try (TikaInputStream detectStream = 
TikaInputStream.get(entryName)) {
                 List<EncodingResult> encResults =

Review Comment:
   This change removes the byte-extension (“cyclic repeat”) logic for short 
non-Unicode ZIP entry names, but the integration test for tiny Shift_JIS 
filenames is now disabled because detection is unreliable. That indicates a 
functional regression in entry-name decoding. Either restore a short-input 
stabilization strategy here (or implement a Mojibuster-specific short-name 
path) so Shift_JIS filenames decode reliably, rather than shipping with the 
regression.



##########
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/tools/BuildJunkAugmentationDataTest.java:
##########
@@ -0,0 +1,444 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.junkdetect.tools;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.zip.GZIPInputStream;
+import java.util.zip.GZIPOutputStream;
+
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.metadata.TikaCoreProperties;
+import org.apache.tika.serialization.JsonMetadataList;
+
+class BuildJunkAugmentationDataTest {
+
+    @Test
+    void chunkSplitsLongLinesAtWhitespace() {
+        StringBuilder sb = new StringBuilder();
+        // 1200-char line, single paragraph.
+        for (int i = 0; i < 200; i++) {
+            sb.append("aaaa bbbb ccc ");
+        }
+        List<String> chunks = 
BuildJunkAugmentationData.chunk(sb.toString().strip());
+        assertTrue(chunks.size() >= 2, "expected multiple chunks, got " + 
chunks.size());
+        for (String c : chunks) {
+            assertTrue(c.length() <= BuildJunkAugmentationData.MAX_CHUNK_CHARS,
+                    "chunk over MAX_CHUNK_CHARS: " + c.length());
+        }
+    }
+
+    @Test
+    void chunkGreedilyConcatenatesShortLines() {
+        // HTML-extracted text typically arrives as many short 
newline-separated
+        // fragments. The chunker should pack them into target-sized chunks
+        // instead of emitting each fragment as its own training sample.
+        String input = "Hello world.\nSecond paragraph here.\n\nThird.";
+        List<String> chunks = BuildJunkAugmentationData.chunk(input);
+        // total length 42 chars, well under TARGET_CHUNK_CHARS — single chunk
+        assertEquals(1, chunks.size());
+        assertEquals("Hello world. Second paragraph here. Third.", 
chunks.get(0));
+    }
+
+    @Test
+    void chunkEmitsBufferThenSlicesLongLine() {
+        // Short header line, then a long paragraph: the header should flush
+        // before the long line is sliced.
+        String longLine = "x".repeat(700);
+        String input = "header line\n" + longLine + "\ntail line";
+        List<String> chunks = BuildJunkAugmentationData.chunk(input);
+        // expected: "header line", then 2 slices of the long x-string, then 
"tail line"
+        assertEquals("header line", chunks.get(0));
+        assertEquals("tail line", chunks.get(chunks.size() - 1));
+        // Long-line slices are bounded by MAX_CHUNK_CHARS.
+        for (int i = 1; i < chunks.size() - 1; i++) {
+            assertTrue(chunks.get(i).length() <= 
BuildJunkAugmentationData.MAX_CHUNK_CHARS);
+        }
+    }
+
+    @Test
+    void dominantScriptIdentifiesLatin() {
+        String text = "The quick brown fox jumps over the lazy dog. Copyright 
© 2026.";
+        BuildJunkAugmentationData.DocScript ds =
+                BuildJunkAugmentationData.dominantScript(text);
+        assertEquals(Character.UnicodeScript.LATIN, ds.script);
+        assertTrue(ds.dominance >= 0.99, "expected near-100% LATIN, got " + 
ds.dominance);
+    }
+
+    @Test
+    void dominantScriptIdentifiesMixedTextAsBelowThreshold() {
+        // ~50% Latin, ~50% Han — should fall below the 80% dominance gate.
+        String text = "Hello world 这是中文测试内容 testing 测试更多 abc def ghi 
中文混合内容更多内容";
+        BuildJunkAugmentationData.DocScript ds =
+                BuildJunkAugmentationData.dominantScript(text);
+        assertTrue(ds.dominance < 
BuildJunkAugmentationData.MIN_DOC_SCRIPT_DOMINANCE,
+                "expected mixed-script to fail dominance gate, got " + 
ds.dominance);
+    }
+
+    @Test
+    void dominantScriptReturnsNullOnEmptyContent() {
+        BuildJunkAugmentationData.DocScript ds =
+                BuildJunkAugmentationData.dominantScript("\t\n   ");
+        assertNull(ds.script);
+        assertEquals(0.0, ds.dominance);
+    }
+
+    @Test
+    void scanBaselineLineCountsReadsTrainFilesOnly(@TempDir Path tmp) throws 
Exception {
+        // baseline: latin.train.gz with 3 lines, cyrillic.train.gz with 2,
+        // plus a dev split that should be ignored by the scan.
+        writeGz(tmp.resolve("latin.train.gz"), List.of("alpha", "beta", 
"gamma"));
+        writeGz(tmp.resolve("cyrillic.train.gz"), List.of("один", "два"));
+        writeGz(tmp.resolve("latin.dev.gz"), List.of("dev1", "dev2", "dev3"));
+
+        Map<String, Long> counts =
+                BuildJunkAugmentationData.scanBaselineLineCounts(tmp);
+        assertEquals(2, counts.size());
+        assertEquals(3L, counts.get("latin"));
+        assertEquals(2L, counts.get("cyrillic"));
+    }
+
+    @Test
+    void rewriteTrainWithAppendPreservesOriginalAndAddsLines(@TempDir Path tmp)
+            throws Exception {
+        Path src = tmp.resolve("src.train.gz");
+        Path dst = tmp.resolve("dst.train.gz");
+        writeGz(src, List.of("one", "two", "three"));
+
+        BuildJunkAugmentationData.rewriteTrainWithAppend(src, dst, 
List.of("FOUR", "FIVE"));
+
+        List<String> lines = readGz(dst);
+        assertEquals(List.of("one", "two", "three", "FOUR", "FIVE"), lines);
+    }
+
+    @Test
+    void endToEndAugmentsLatinAndSkipsBelowGate(@TempDir Path tmp) throws 
Exception {
+        // -- baseline --
+        Path baseline = tmp.resolve("baseline");
+        Files.createDirectories(baseline);
+        // 100 baseline LATIN lines → 10% cap = 10
+        List<String> baselineLatin = new ArrayList<>();
+        for (int i = 0; i < 100; i++) {
+            baselineLatin.add("baseline-latin-" + i);
+        }
+        writeGz(baseline.resolve("latin.train.gz"), baselineLatin);
+        writeGz(baseline.resolve("latin.dev.gz"), List.of("latin-dev"));
+        writeGz(baseline.resolve("latin.test.gz"), List.of("latin-test"));
+        // HAN baseline present, but extracts won't reach the doc gate.
+        writeGz(baseline.resolve("han.train.gz"), List.of("基线汉字一", "基线汉字二"));
+
+        // -- extracts --
+        Path extracts = tmp.resolve("extracts");
+        Files.createDirectories(extracts);
+        // 6 latin docs, each carrying enough chunks to easily exceed cap.
+        for (int i = 0; i < 6; i++) {
+            StringBuilder content = new StringBuilder();
+            for (int j = 0; j < 12; j++) {
+                content.append("This is web content line number ")
+                        .append(j)
+                        .append(" inside document ")
+                        .append(i)
+                        .append(", with copyright © 2026 and other symbols ® ™ 
£ €.\n");
+            }
+            writeExtract(extracts.resolve("latin-" + i + ".json"), 
content.toString());
+        }
+        // 1 HAN doc — well below MIN_DOCS gate, should not augment HAN.
+        StringBuilder hanContent = new StringBuilder();
+        for (int j = 0; j < 30; j++) {
+            hanContent.append("这是一段中文测试内容用于检查脚本检测和分块功能是否能够正确识别汉字主导的文档并通过质量过滤器")
+                    .append("\n");
+        }
+        writeExtract(extracts.resolve("han-1.json"), hanContent.toString());
+
+        // -- run --
+        Path output = tmp.resolve("output");
+        BuildJunkAugmentationData.main(new String[]{
+                "--extracts", extracts.toString(),
+                "--baseline", baseline.toString(),
+                "--output", output.toString(),
+                "--min-docs", "3",          // lower so 6 latin docs pass
+                "--hard-cap", "1000",       // do not constrain via hard cap
+                "--baseline-frac-cap", "0.10",
+                "--seed", "1"
+        });
+
+        // -- assertions --
+        // latin appended at most 10 (10% of 100 baseline lines).
+        List<String> latinOut = readGz(output.resolve("latin.train.gz"));
+        assertEquals(110, latinOut.size(),
+                "latin: 100 baseline + 10 appended (10% cap)");
+        // baseline lines preserved verbatim and come first
+        assertEquals(baselineLatin, latinOut.subList(0, 100));
+        // appended chunks all derive from extracts and are non-empty
+        for (int i = 100; i < 110; i++) {
+            assertFalse(latinOut.get(i).isEmpty());
+        }
+
+        // HAN copied unchanged (single doc < min-docs gate).
+        List<String> hanOut = readGz(output.resolve("han.train.gz"));
+        assertEquals(List.of("基线汉字一", "基线汉字二"), hanOut);
+
+        // dev and test split copied verbatim.
+        assertEquals(List.of("latin-dev"), 
readGz(output.resolve("latin.dev.gz")));
+        assertEquals(List.of("latin-test"), 
readGz(output.resolve("latin.test.gz")));
+
+        // Manifest present
+        Path manifest = output.resolve("augmentation_manifest.tsv");
+        assertTrue(Files.exists(manifest));
+        String manifestText = Files.readString(manifest);
+        assertTrue(manifestText.contains("LATIN"), "manifest should report 
LATIN row");
+    }
+
+    @Test
+    void structuralFilterDropsUtf8AsWin1252Mojibake() {
+        // Real mojibake samples from our augmentation analysis.
+        String mojiGerman  = "Die EASA in BrÃ¼ssel hat aufgrund "
+                + "der europaweit festzustellenden Beschwerden";
+        String mojiItalian = "Mi Ã¨ appena nato un pulso dalle uova dentro il 
nido";
+        String cleanGerman = "Die EASA in Brüssel hat aufgrund "
+                + "der europaweit festzustellenden Beschwerden";
+        String cleanItalian = "Mi è appena nato un pulso dalle uova dentro il 
nido";
+        // Mojibake should produce ≥1 of the structural bigrams; clean Latin 
none.
+        
assertTrue(BuildJunkAugmentationData.countUtf8AsWin1252Bigrams(mojiGerman) >= 1,
+                "expected mojibake-shape bigrams in German sample");
+        
assertTrue(BuildJunkAugmentationData.countUtf8AsWin1252Bigrams(mojiItalian) >= 
1,
+                "expected mojibake-shape bigrams in Italian sample");
+        assertEquals(0, 
BuildJunkAugmentationData.countUtf8AsWin1252Bigrams(cleanGerman),
+                "clean German should not match the mojibake structural shape");
+        assertEquals(0, 
BuildJunkAugmentationData.countUtf8AsWin1252Bigrams(cleanItalian),
+                "clean Italian should not match the mojibake structural 
shape");
+    }
+
+    @Test
+    void profileCsvFiltersByOovAndLangness(@TempDir Path tmp) throws Exception 
{
+        // -- baseline --
+        Path baseline = tmp.resolve("baseline");
+        Files.createDirectories(baseline);
+        List<String> baselineLatin = new ArrayList<>();
+        for (int i = 0; i < 100; i++) {
+            baselineLatin.add("baseline-latin-" + i);
+        }
+        writeGz(baseline.resolve("latin.train.gz"), baselineLatin);
+
+        // -- extracts: 4 latin docs with distinguishable content --
+        Path extracts = tmp.resolve("extracts");
+        Path sub = extracts.resolve("aa");
+        Files.createDirectories(sub);
+        for (int i = 0; i < 4; i++) {
+            StringBuilder content = new StringBuilder();
+            for (int j = 0; j < 12; j++) {
+                content.append("Quality web text paragraph number ")
+                        .append(j)
+                        .append(" inside document ")
+                        .append(i)
+                        .append(", with copyright © 2026 and other 
markings.\n");
+            }
+            writeExtract(sub.resolve("doc" + i + ".json"), content.toString());
+        }
+
+        // -- profile CSV: only docs 0 and 1 pass (low OOV, positive langness) 
--
+        Path csv = tmp.resolve("profile.csv");
+        try (BufferedWriter w = Files.newBufferedWriter(csv, 
StandardCharsets.UTF_8)) {
+            w.write("\"FILE_PATH\",\"OOV\",\"LANGNESS\",\"LANG\"\n");
+            w.write("\"aa/doc0\",\"0.3\",\"0.5\",\"eng\"\n");      // pass
+            w.write("\"aa/doc1\",\"0.4\",\"0.1\",\"eng\"\n");      // pass
+            w.write("\"aa/doc2\",\"0.7\",\"0.5\",\"eng\"\n");      // fail OOV
+            w.write("\"aa/doc3\",\"0.2\",\"-0.5\",\"eng\"\n");     // fail 
langness
+            // doc with no profile row is also dropped (covered separately)
+        }
+
+        Path output = tmp.resolve("output");
+        BuildJunkAugmentationData.main(new String[]{
+                "--extracts", extracts.toString(),
+                "--baseline", baseline.toString(),
+                "--output", output.toString(),
+                "--profile-csv", csv.toString(),
+                "--max-oov", "0.5",
+                "--min-langness", "0.0",
+                "--min-docs", "1",
+                "--hard-cap", "1000",
+                "--baseline-frac-cap", "1.0",
+                "--seed", "1"
+        });
+
+        List<String> out = readGz(output.resolve("latin.train.gz"));
+        // 100 baseline lines + chunks from only 2 docs that pass profile 
filter
+        // each doc has 12 short lines, each long enough to pass min-bytes 
filter
+        assertTrue(out.size() > 100, "expected augmentation, got " + 
out.size());
+        int added = out.size() - 100;
+        assertTrue(added > 0 && added <= 24,
+                "expected appended lines from 2 docs (<=24 chunks), got " + 
added);
+    }
+
+    @Test
+    void containsTargetSymbolDetectsStarvedSymbols() {
+        assertTrue(BuildJunkAugmentationData.containsTargetSymbol("Copyright © 
2024 GmbH"));
+        assertTrue(BuildJunkAugmentationData.containsTargetSymbol("Marke ® 
Produkt"));
+        assertTrue(BuildJunkAugmentationData.containsTargetSymbol("Preis £ 
19.99"));
+        assertFalse(BuildJunkAugmentationData.containsTargetSymbol(
+                "Für Anfänger empfehlen wir den Grundkurs"));
+        // Š (the mojibake reading) is NOT a target — we boost the win-1252 
source symbols.
+        assertFalse(BuildJunkAugmentationData.containsTargetSymbol("Škoda 
Praha"));
+    }
+
+    @Test
+    void symbolBoostReservesQuotaForSymbolChunks(@TempDir Path tmp) throws 
Exception {
+        Path baseline = tmp.resolve("baseline");
+        Files.createDirectories(baseline);
+        // 100 baseline LATIN lines → 10% cap = 10 (then we set hard-cap=10).
+        List<String> base = new ArrayList<>();
+        for (int i = 0; i < 100; i++) {
+            base.add("baseline-latin-" + i);
+        }
+        writeGz(baseline.resolve("latin.train.gz"), base);
+
+        // Extracts: many symbol-free docs + a few symbol-bearing ones.
+        Path extracts = tmp.resolve("extracts");
+        Path sub = extracts.resolve("aa");
+        Files.createDirectories(sub);
+        for (int i = 0; i < 20; i++) {
+            StringBuilder c = new StringBuilder();
+            for (int j = 0; j < 6; j++) {
+                c.append("Plain German prose paragraph number ").append(j)
+                        .append(" in document ").append(i)
+                        .append(" with enough words to pass filters here.\n");
+            }
+            writeExtract(sub.resolve("plain" + i + ".json"), c.toString());
+        }
+        for (int i = 0; i < 6; i++) {
+            StringBuilder c = new StringBuilder();
+            for (int j = 0; j < 6; j++) {
+                c.append("Impressum line ").append(j).append(" Copyright © 
2024 Müller GmbH ")
+                        .append("Marke ® registriert, Preis £ 49 in document 
").append(i).append(".\n");
+            }
+            writeExtract(sub.resolve("symbol" + i + ".json"), c.toString());
+        }
+
+        Path output = tmp.resolve("output");
+        BuildJunkAugmentationData.main(new String[]{
+                "--extracts", extracts.toString(),
+                "--baseline", baseline.toString(),
+                "--output", output.toString(),
+                "--min-docs", "1",
+                "--hard-cap", "10",
+                "--baseline-frac-cap", "1.0",
+                "--symbol-boost", "0.5",
+                "--seed", "1"
+        });
+
+        List<String> out = readGz(output.resolve("latin.train.gz"));
+        List<String> appended = out.subList(100, out.size());
+        long symbolBearing = appended.stream()
+                
.filter(BuildJunkAugmentationData::containsTargetSymbol).count();
+        // quota = floor(10 * 0.5) = 5; symbol pool has ≥5 chunks, so ≥5 
appended
+        // lines should be symbol-bearing.
+        assertTrue(symbolBearing >= 5,
+                "expected >=5 symbol-bearing lines with 0.5 boost, got " + 
symbolBearing);
+    }
+
+    @Test
+    void profileCsvParserHandlesQuotedFields() {
+        String header = "\"FILE_PATH\",\"OOV\",\"LANGNESS\",\"LANG\"";
+        String[] cols = BuildJunkAugmentationData.parseCsvLine(header);
+        assertEquals(4, cols.length);
+        assertEquals("FILE_PATH", cols[0]);
+        assertEquals("OOV", cols[1]);
+        String row = "\"aa/foo\",\"0.42\",\"0.1\",\"eng\"";
+        String[] f = BuildJunkAugmentationData.parseCsvLine(row);
+        assertEquals(4, f.length);
+        assertEquals("aa/foo", f[0]);
+        assertEquals("0.42", f[1]);
+    }
+
+    @Test
+    void profileKeyMatchesExtractPath(@TempDir Path tmp) {
+        Path extracts = tmp.resolve("extracts");
+        Path file = extracts.resolve("0F").resolve("ABCD1234.json");
+        assertEquals("0F/ABCD1234", 
BuildJunkAugmentationData.profileKey(extracts, file));
+    }
+
+    @Test
+    void refusesOutputEqualToBaseline(@TempDir Path tmp) throws Exception {
+        Path baseline = tmp.resolve("baseline");
+        Path extracts = tmp.resolve("extracts");
+        Files.createDirectories(baseline);
+        Files.createDirectories(extracts);
+        writeGz(baseline.resolve("latin.train.gz"), List.of("x"));
+
+        // Run in same JVM, catch System.exit. Easiest path is a 
SecurityManager,
+        // but JDK 17 deprecates that. Instead, hit the static helper directly
+        // for isSameFile semantics.
+        assertTrue(Files.isSameFile(baseline, baseline),
+                "sanity: same directory is same file");
+    }

Review Comment:
   `refusesOutputEqualToBaseline` doesn’t actually exercise 
BuildJunkAugmentationData’s refusal logic. It only asserts 
`Files.isSameFile(baseline, baseline)`, which will always be true and won’t 
fail even if the tool mistakenly allows `--output == --baseline` (especially 
because the tool only checks equality when outputDir exists). Consider either 
removing this test or refactoring BuildJunkAugmentationData to expose a 
testable helper (or avoid System.exit) so the real guard can be asserted.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4731 - improve charset detection and junk detection [tika]

Reply via email to