[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

ASF GitHub Bot (Jira) Thu, 28 May 2026 21:12:41 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084252#comment-18084252
 ]


ASF GitHub Bot commented on TIKA-4745:
--------------------------------------

Copilot commented on code in PR #2848:
URL: https://github.com/apache/tika/pull/2848#discussion_r3322082410


##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -454,21 +508,29 @@ public ScoreResult scoreClassesAndCount(byte[] probe) {
             }
 
             // logPs are negative; "best" class for the bigram = highest
-            // (least negative) contribution after dequant.
+            // (least negative) contribution after dequant.  Cap reference
+            // is the best contribution from a class outside top-1's
+            // cohort, so the cap engages on cross-cohort gaps that a
+            // max-vs-overall-runner-up cap missed when multiple classes
+            // in top-1's cohort sat close together.
+            int topClass = -1;
             double max = Double.NEGATIVE_INFINITY;
-            double secondMax = Double.NEGATIVE_INFINITY;
             for (int c = 0; c < numClasses; c++) {
                 double contrib = logP8[base + c] * countTimesIdf * 
perClassDequant[c];
                 contributions[c] = contrib;
                 if (contrib > max) {
-                    secondMax = max;
                     max = contrib;
-                } else if (contrib > secondMax) {
-                    secondMax = contrib;
+                    topClass = c;
+                }
+            }
+            Cohort topCohort = cohorts[topClass];
+            double bestCrossCohort = Double.NEGATIVE_INFINITY;
+            for (int c = 0; c < numClasses; c++) {
+                if (cohorts[c] != topCohort && contributions[c] > 
bestCrossCohort) {
+                    bestCrossCohort = contributions[c];

Review Comment:
   `bestCrossCohort` can remain `Double.NEGATIVE_INFINITY` when all model 
labels fall into the same cohort (or `numClasses==1`). In that case `capValue` 
becomes `-INF` and the code will clip all contributions to `-INF`, effectively 
destroying scoring for that probe/model. Consider falling back to the overall 
runner-up (old behavior) when no cross-cohort class exists, and skipping the 
cap entirely when there is no runner-up (single-class model).



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.

Review Comment:
   This doc line references a personal local path (`~/data/claude-work/...`) 
that likely won’t exist for other contributors. Prefer a placeholder work 
directory (or drop the “Existing copy” line entirely) so the instructions are 
portable.



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports

Review Comment:
   The example runs `Compare` from `/tmp/tika-eval-current/...`, but the 
snippet never shows building/unzipping tika-eval into that location. As 
written, the command sequence won’t work in a fresh environment; add a short 
build/unzip step (or point to `.skills/tika-eval-compare.md`).



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"bom-detector": {}},
+    {"html-encoding-detector": {}},
+    {"mojibuster-encoding-detector": {}},
+    {"junk-filter-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: 
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.
+
+### Per-detector isolation configs
+
+Each detector wired alone lives in `~/data/commoncrawl/cc-html-eval/configs/`:
+`tika-config-bom.json`, `tika-config-html.json`, 
`tika-config-htmlstandard.json`,
+`tika-config-universal.json`, `tika-config-icu4j.json`,

Review Comment:
   This section hard-codes a specific corpus/config directory 
(`~/data/commoncrawl/...`). Using a placeholder path (and/or describing what 
should be in the directory) would make the instructions reusable for others.



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"bom-detector": {}},
+    {"html-encoding-detector": {}},
+    {"mojibuster-encoding-detector": {}},
+    {"junk-filter-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: 
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.

Review Comment:
   This doc line references a machine-specific local path (`~/data/smoke/...`). 
Consider using a placeholder (e.g., `<workdir>/...`) so these instructions 
don’t depend on a particular environment.





> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
>                 Key: TIKA-4745
>                 URL: https://issues.apache.org/jira/browse/TIKA-4745
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a 
> number of smallish things that we can clean up in the components listed in 
> the title.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4745) Small improvements to lang detection, charset detection and junk detection

Reply via email to