[
https://issues.apache.org/jira/browse/TIKA-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083699#comment-18083699
]
ASF GitHub Bot commented on TIKA-4731:
--------------------------------------
Copilot commented on code in PR #2839:
URL: https://github.com/apache/tika/pull/2839#discussion_r3308398037
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/BuildJunkTrainingData.java:
##########
@@ -658,6 +658,11 @@ static String filterSentence(String text, int minBytes,
double maxPuncFrac,
if (text.indexOf('\uFFFD') >= 0) {
return null;
}
+ // NFD (not NFC) so combining-mark scripts (Vietnamese precomposed,
+ // Indic, Thai) have their marks as separate codepoints in the
+ // training corpus. Lets per-script bigram tables and z5 (letter-
+ // adjacent-to-mark) discriminate uniformly across mark-using
+ // scripts. Must match JunkDetector.scoreText's normalization.
text = Normalizer.normalize(text, Normalizer.Form.NFC);
Review Comment:
The normalization comment says “NFD (not NFC)” but the code normalizes with
Normalizer.Form.NFC, and JunkDetector.aggregate() also NFC-normalizes. Please
fix the comment (or the normalization) so training/inference behavior and the
rationale match; as written it’s internally inconsistent and misleading for
future changes.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/AdaptiveProbe.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.commons.io.IOUtils;
+
+import org.apache.tika.io.TikaInputStream;
+
+/**
+ * Reads an encoding-detection probe sized by <em>content</em>, not raw bytes.
+ *
+ * <p>A fixed raw probe (e.g. 16 KB) starves the detectors when a page
+ * leads with a large {@code <head>}/inline script: after tag-stripping there's
+ * little body text, and the bytes that distinguish charsets sit past the
+ * window. This grows the read until ~{@code contentTarget} bytes of
+ * tag-stripped content are present, capped at {@code rawCap} raw bytes.
+ *
+ * <p>Text-rich pages stop early (~one chunk); markup-heavy pages read deeper,
+ * bounded by the cap. Multi-byte encodings (UTF-16/32) register no ASCII tags
+ * so they stop at {@code contentTarget} raw bytes.
+ */
+public final class AdaptiveProbe {
+
+ /** Default body-content target. */
+ public static final int DEFAULT_CONTENT_TARGET = 16384;
+ /** Default hard ceiling on raw bytes read. */
+ public static final int DEFAULT_RAW_CAP = 524288;
+
+ private AdaptiveProbe() {
+ }
+
+ /**
+ * Reads from {@code tis} (mark/reset preserved) until tag-stripped content
+ * reaches {@code contentTarget}, the raw read reaches {@code rawCap}, or
+ * EOF — whichever first. Returns the raw bytes read.
+ */
+ public static byte[] read(TikaInputStream tis, int contentTarget, int
rawCap)
+ throws IOException {
+ tis.mark(rawCap);
+ try {
+ byte[] buf = new byte[rawCap];
+ byte[] stripDst = new byte[rawCap];
+ int total = 0;
+ while (total < rawCap) {
+ int want = Math.min(rawCap - total, contentTarget);
+ int n = IOUtils.read(tis, buf, total, want);
+ total += n;
+ HtmlByteStripper.Result r =
+ HtmlByteStripper.stripTags(buf, 0, total, stripDst, 0);
+ int content = r.tagCount > 0 ? r.length : total;
+ if (content >= contentTarget || n < want) {
+ break; // enough body text, or EOF
+ }
+ }
Review Comment:
AdaptiveProbe.read adds the return value of IOUtils.read() to `total`
without handling EOF. IOUtils.read returns -1 at EOF, which makes `total`
negative and can trigger invalid array copies / stripTags calls (this should be
hit by the empty-input test). Handle `n < 0` (or `n <= 0`) as EOF and break
without modifying `total`.
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -211,62 +268,262 @@ public List<EncodingResult> detect(TikaInputStream tis,
Metadata metadata,
return detect(readProbe(tis));
}
+ /** ASCII whitespace: TAB, LF, VT, FF, CR, SPACE. */
+ private static boolean isWhitespace(int b) {
+ return b == 0x09 || b == 0x0a || b == 0x0b || b == 0x0c
+ || b == 0x0d || b == 0x20;
+ }
+
public List<EncodingResult> detect(byte[] probe) {
- if (probe == null || probe.length < 2) {
+ ScoreResult sr = scoreClassesAndCount(probe);
+ if (sr == null) {
return Collections.emptyList();
}
- int len = Math.min(probe.length, MAX_PROBE_BYTES);
+ return emitCandidates(sr.scores, sr.scoredBigrams);
+ }
+
+ /**
+ * Score result returned by {@link #scoreClassesAndCount(byte[])}.
+ * Exposes the raw per-class score vector together with the number
+ * of bigrams that actually contributed to the dot product (i.e.,
+ * bigrams with non-zero IDF and not skipped by the whitespace-pair
+ * rule) and the total bigrams in the scored region of the probe.
+ * {@code scoredBigrams} is the unit of "evidence available to NB"
+ * — robust to HTML / whitespace noise in the input because those
+ * bigrams have IDF == 0 and don't contribute.
+ */
+ public static final class ScoreResult {
+ public final double[] scores;
+ public final int scoredBigrams;
+ public final int totalBigrams;
+ public ScoreResult(double[] scores, int scoredBigrams, int
totalBigrams) {
+ this.scores = scores;
+ this.scoredBigrams = scoredBigrams;
+ this.totalBigrams = totalBigrams;
+ }
+ }
+
+ /**
+ * Compute the raw per-class score vector for a probe, without
+ * top-K extraction or softmax. Returns {@code null} for null /
+ * tiny probes that can't be scored.
+ */
+ public double[] scoreClasses(byte[] probe) {
+ ScoreResult sr = scoreClassesAndCount(probe);
+ return sr == null ? null : sr.scores;
+ }
+
+ /**
+ * Per-bigram contribution to the per-class score, used for
+ * diagnostic tools that want to understand why a probe scores
+ * one class over another. Returned by
+ * {@link #analyzeBigrams(byte[], int, int)}.
+ */
+ public static final class BigramContrib {
+ public final int bigram; // (b0 << 8) | b1
+ public final double contribA; // logP_A * idf in nats
+ public final double contribB;
+ public BigramContrib(int bigram, double a, double b) {
+ this.bigram = bigram;
+ this.contribA = a;
+ this.contribB = b;
+ }
+ public double diff() {
+ return contribA - contribB;
+ }
+ }
- // Integer hot loop — CharSoup-style. int8 logP × int8 IDF →
- // int16 product, accumulated into int32 per class. Overflow
- // safety: at MAX_PROBE_BYTES=1024, max 1023 bigrams × 127 × 127
- // ≈ 16.5M per class, well inside int32's 2.1B headroom.
- int[] dots = new int[numClasses];
+ /**
+ * For each scored bigram in the probe (same skip rules as
+ * {@link #scoreClasses(byte[])}), compute and return its
+ * dequantized contribution to two specified classes' scores.
+ * The list is in probe order, with duplicates allowed (a bigram
+ * that appears N times in the probe yields N entries).
+ */
+ public List<BigramContrib> analyzeBigrams(byte[] probe, int classA, int
classB) {
+ List<BigramContrib> out = new java.util.ArrayList<>();
+ if (probe == null || probe.length < 2) {
+ return out;
+ }
+ int len = Math.min(probe.length, MAX_PROBE_BYTES);
+ // perClassDequant[c] folds scale[c] × idfScale already, so
+ // contribution(bigram, c) = logP8[..c] * idf8[bigram] *
perClassDequant[c]
+ double dqA = perClassDequant[classA];
+ double dqB = perClassDequant[classB];
for (int i = 0; i + 1 < len; i++) {
- int bigram = ((probe[i] & 0xFF) << 8) | (probe[i + 1] & 0xFF);
- int w = idf8[bigram]; // non-negative, 0..127
+ int b0 = probe[i] & 0xFF;
+ int b1 = probe[i + 1] & 0xFF;
+ if (isWhitespace(b0) && isWhitespace(b1)) {
+ continue;
+ }
+ int bigram = (b0 << 8) | b1;
+ int w = idf8[bigram];
if (w == 0) {
- continue; // bigram appears in every class; no signal
+ continue;
}
int base = bigram * numClasses;
- for (int c = 0; c < numClasses; c++) {
- dots[c] += logP8[base + c] * w;
+ double contribA = logP8[base + classA] * w * dqA;
+ double contribB = logP8[base + classB] * w * dqB;
+ out.add(new BigramContrib(bigram, contribA, contribB));
+ }
+ return out;
+ }
+
+ /**
+ * Like {@link #scoreClasses(byte[])} but also reports the number
+ * of bigrams that contributed to the dot product vs the total
+ * scored region. Used by offline calibration to bucket samples
+ * by "evidence available" rather than raw byte length.
+ */
+ public ScoreResult scoreClassesAndCount(byte[] probe) {
+ if (probe == null || probe.length < 2) {
+ return null;
+ }
+ int len = Math.min(probe.length, MAX_PROBE_BYTES);
+
+ // Pass 1: count distinct bigrams. Whitespace and zero-IDF
+ // bigrams are skipped as in the original hot loop. short[] is
+ // enough since count fits in 16383 (max possible). Track the
+ // ids of distinct bigrams in a parallel array so pass 2 doesn't
+ // need to scan the full 65k space.
+ short[] count = new short[BIGRAM_SPACE];
+ int[] distinctBigrams = new int[len];
+ int distinctIdx = 0;
+ int scored = 0;
+ int total = 0;
+ for (int i = 0; i + 1 < len; i++) {
+ int b0 = probe[i] & 0xFF;
+ int b1 = probe[i + 1] & 0xFF;
+ total++;
+ if (isWhitespace(b0) && isWhitespace(b1)) {
+ continue;
+ }
+ int bigram = (b0 << 8) | b1;
+ int w = idf8[bigram];
+ if (w == 0) {
+ continue;
}
+ scored++;
+ if (count[bigram] == 0) {
+ distinctBigrams[distinctIdx++] = bigram;
+ }
+ count[bigram]++;
+ }
Review Comment:
scoreClassesAndCount allocates several large arrays on every detect call
(notably `new short[BIGRAM_SPACE]` = 65,536 entries) plus `int[len]` and
`double[numClasses]`. Since this runs in production encoding detection, the
per-call allocations can become a GC hotspot when parsing many documents.
Consider reusing these buffers via ThreadLocal or instance fields (with careful
reset) to keep detection allocation-free like the prior hot loop.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java:
##########
@@ -33,6 +34,9 @@ public void handleNonUnicodeEntryName() throws Exception {
}
@Test
+ @Disabled("TIKA-4731: tiny SJIS filenames are not reliably detected after
removal "
+ + "of the cyclic-repeat hack in ZipParser. Re-enable when
zip-entry-name "
+ + "detection is fixed (separate from chain rework).")
public void handleEntryNameWithCharsetShiftJIS() throws Exception {
Review Comment:
Disabling this test papers over a regression in Shift_JIS ZIP entry-name
detection (tiny filenames). Tests should generally not be disabled to
accommodate known breakage; please fix ZipParser’s entry-name charset detection
(or reintroduce the prior stabilization) so this can remain enabled.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/ZipParser.java:
##########
@@ -565,25 +557,12 @@ private String detectEntryName(ZipArchiveEntry entry,
Metadata parentMetadata,
return new String(entry.getRawName(), config.getEntryEncoding());
}
- // If charset detection is enabled, try to detect and decode
+ // If charset detection is enabled, try to detect and decode.
+ // Mojibuster handles short inputs natively (zip filenames are often
+ // 9-30 bytes); no byte-extension trick needed.
if (config.isDetectCharsetsInEntryNames()) {
byte[] entryName = entry.getRawName();
- // Extend short entry names before detection: statistical detectors
- // (e.g. UniversalEncodingDetector, Icu4j) need enough material to
- // make a confident call. Cyclically repeat the bytes so the
- // detector still sees the same byte distribution.
- byte[] extendedEntryName = entryName;
- if (entryName != null && 0 < entryName.length
- && entryName.length < MIN_BYTES_FOR_DETECTING_CHARSET) {
- int len = entryName.length
- * (MIN_BYTES_FOR_DETECTING_CHARSET / entryName.length);
- extendedEntryName = new byte[len];
- for (int i = 0; i < len; i++) {
- extendedEntryName[i] = entryName[i % entryName.length];
- }
- }
-
- try (TikaInputStream detectStream =
TikaInputStream.get(extendedEntryName)) {
+ try (TikaInputStream detectStream =
TikaInputStream.get(entryName)) {
List<EncodingResult> encResults =
Review Comment:
This change removes the byte-extension (“cyclic repeat”) logic for short
non-Unicode ZIP entry names, but the integration test for tiny Shift_JIS
filenames is now disabled because detection is unreliable. That indicates a
functional regression in entry-name decoding. Either restore a short-input
stabilization strategy here (or implement a Mojibuster-specific short-name
path) so Shift_JIS filenames decode reliably, rather than shipping with the
regression.
##########
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/tools/BuildJunkAugmentationDataTest.java:
##########
@@ -0,0 +1,444 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.junkdetect.tools;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.zip.GZIPInputStream;
+import java.util.zip.GZIPOutputStream;
+
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.metadata.TikaCoreProperties;
+import org.apache.tika.serialization.JsonMetadataList;
+
+class BuildJunkAugmentationDataTest {
+
+ @Test
+ void chunkSplitsLongLinesAtWhitespace() {
+ StringBuilder sb = new StringBuilder();
+ // 1200-char line, single paragraph.
+ for (int i = 0; i < 200; i++) {
+ sb.append("aaaa bbbb ccc ");
+ }
+ List<String> chunks =
BuildJunkAugmentationData.chunk(sb.toString().strip());
+ assertTrue(chunks.size() >= 2, "expected multiple chunks, got " +
chunks.size());
+ for (String c : chunks) {
+ assertTrue(c.length() <= BuildJunkAugmentationData.MAX_CHUNK_CHARS,
+ "chunk over MAX_CHUNK_CHARS: " + c.length());
+ }
+ }
+
+ @Test
+ void chunkGreedilyConcatenatesShortLines() {
+ // HTML-extracted text typically arrives as many short
newline-separated
+ // fragments. The chunker should pack them into target-sized chunks
+ // instead of emitting each fragment as its own training sample.
+ String input = "Hello world.\nSecond paragraph here.\n\nThird.";
+ List<String> chunks = BuildJunkAugmentationData.chunk(input);
+ // total length 42 chars, well under TARGET_CHUNK_CHARS — single chunk
+ assertEquals(1, chunks.size());
+ assertEquals("Hello world. Second paragraph here. Third.",
chunks.get(0));
+ }
+
+ @Test
+ void chunkEmitsBufferThenSlicesLongLine() {
+ // Short header line, then a long paragraph: the header should flush
+ // before the long line is sliced.
+ String longLine = "x".repeat(700);
+ String input = "header line\n" + longLine + "\ntail line";
+ List<String> chunks = BuildJunkAugmentationData.chunk(input);
+ // expected: "header line", then 2 slices of the long x-string, then
"tail line"
+ assertEquals("header line", chunks.get(0));
+ assertEquals("tail line", chunks.get(chunks.size() - 1));
+ // Long-line slices are bounded by MAX_CHUNK_CHARS.
+ for (int i = 1; i < chunks.size() - 1; i++) {
+ assertTrue(chunks.get(i).length() <=
BuildJunkAugmentationData.MAX_CHUNK_CHARS);
+ }
+ }
+
+ @Test
+ void dominantScriptIdentifiesLatin() {
+ String text = "The quick brown fox jumps over the lazy dog. Copyright
© 2026.";
+ BuildJunkAugmentationData.DocScript ds =
+ BuildJunkAugmentationData.dominantScript(text);
+ assertEquals(Character.UnicodeScript.LATIN, ds.script);
+ assertTrue(ds.dominance >= 0.99, "expected near-100% LATIN, got " +
ds.dominance);
+ }
+
+ @Test
+ void dominantScriptIdentifiesMixedTextAsBelowThreshold() {
+ // ~50% Latin, ~50% Han — should fall below the 80% dominance gate.
+ String text = "Hello world 这是中文测试内容 testing 测试更多 abc def ghi
中文混合内容更多内容";
+ BuildJunkAugmentationData.DocScript ds =
+ BuildJunkAugmentationData.dominantScript(text);
+ assertTrue(ds.dominance <
BuildJunkAugmentationData.MIN_DOC_SCRIPT_DOMINANCE,
+ "expected mixed-script to fail dominance gate, got " +
ds.dominance);
+ }
+
+ @Test
+ void dominantScriptReturnsNullOnEmptyContent() {
+ BuildJunkAugmentationData.DocScript ds =
+ BuildJunkAugmentationData.dominantScript("\t\n ");
+ assertNull(ds.script);
+ assertEquals(0.0, ds.dominance);
+ }
+
+ @Test
+ void scanBaselineLineCountsReadsTrainFilesOnly(@TempDir Path tmp) throws
Exception {
+ // baseline: latin.train.gz with 3 lines, cyrillic.train.gz with 2,
+ // plus a dev split that should be ignored by the scan.
+ writeGz(tmp.resolve("latin.train.gz"), List.of("alpha", "beta",
"gamma"));
+ writeGz(tmp.resolve("cyrillic.train.gz"), List.of("один", "два"));
+ writeGz(tmp.resolve("latin.dev.gz"), List.of("dev1", "dev2", "dev3"));
+
+ Map<String, Long> counts =
+ BuildJunkAugmentationData.scanBaselineLineCounts(tmp);
+ assertEquals(2, counts.size());
+ assertEquals(3L, counts.get("latin"));
+ assertEquals(2L, counts.get("cyrillic"));
+ }
+
+ @Test
+ void rewriteTrainWithAppendPreservesOriginalAndAddsLines(@TempDir Path tmp)
+ throws Exception {
+ Path src = tmp.resolve("src.train.gz");
+ Path dst = tmp.resolve("dst.train.gz");
+ writeGz(src, List.of("one", "two", "three"));
+
+ BuildJunkAugmentationData.rewriteTrainWithAppend(src, dst,
List.of("FOUR", "FIVE"));
+
+ List<String> lines = readGz(dst);
+ assertEquals(List.of("one", "two", "three", "FOUR", "FIVE"), lines);
+ }
+
+ @Test
+ void endToEndAugmentsLatinAndSkipsBelowGate(@TempDir Path tmp) throws
Exception {
+ //
> Ongoing improvements to the junk detector
> -----------------------------------------
>
> Key: TIKA-4731
> URL: https://issues.apache.org/jira/browse/TIKA-4731
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> With [https://github.com/apache/tika/pull/2818,] I think we have a decent
> shape for the junk detector.
> There are several areas for improvement, but I think it is ready to go.
> This ticket tracks follow on work, including:
> * Smaller model
> * Handling pathological code block changes
> * Handling candidates with different character counts
> * Other items to be discovered in our commoncrawl/govdocs1 corpus?
> We have some coverage for the middle two item, but need to address those more
> directly.
> This work is not a blocker on the 4.0.0-beta-1 release.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)