Copilot commented on code in PR #2848:
URL: https://github.com/apache/tika/pull/2848#discussion_r3322082410


##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -454,21 +508,29 @@ public ScoreResult scoreClassesAndCount(byte[] probe) {
             }
 
             // logPs are negative; "best" class for the bigram = highest
-            // (least negative) contribution after dequant.
+            // (least negative) contribution after dequant.  Cap reference
+            // is the best contribution from a class outside top-1's
+            // cohort, so the cap engages on cross-cohort gaps that a
+            // max-vs-overall-runner-up cap missed when multiple classes
+            // in top-1's cohort sat close together.
+            int topClass = -1;
             double max = Double.NEGATIVE_INFINITY;
-            double secondMax = Double.NEGATIVE_INFINITY;
             for (int c = 0; c < numClasses; c++) {
                 double contrib = logP8[base + c] * countTimesIdf * 
perClassDequant[c];
                 contributions[c] = contrib;
                 if (contrib > max) {
-                    secondMax = max;
                     max = contrib;
-                } else if (contrib > secondMax) {
-                    secondMax = contrib;
+                    topClass = c;
+                }
+            }
+            Cohort topCohort = cohorts[topClass];
+            double bestCrossCohort = Double.NEGATIVE_INFINITY;
+            for (int c = 0; c < numClasses; c++) {
+                if (cohorts[c] != topCohort && contributions[c] > 
bestCrossCohort) {
+                    bestCrossCohort = contributions[c];

Review Comment:
   `bestCrossCohort` can remain `Double.NEGATIVE_INFINITY` when all model 
labels fall into the same cohort (or `numClasses==1`). In that case `capValue` 
becomes `-INF` and the code will clip all contributions to `-INF`, effectively 
destroying scoring for that probe/model. Consider falling back to the overall 
runner-up (old behavior) when no cross-cohort class exists, and skipping the 
cap entirely when there is no runner-up (single-class model).



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.

Review Comment:
   This doc line references a personal local path (`~/data/claude-work/...`) 
that likely won’t exist for other contributors. Prefer a placeholder work 
directory (or drop the “Existing copy” line entirely) so the instructions are 
portable.



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports

Review Comment:
   The example runs `Compare` from `/tmp/tika-eval-current/...`, but the 
snippet never shows building/unzipping tika-eval into that location. As 
written, the command sequence won’t work in a fresh environment; add a short 
build/unzip step (or point to `.skills/tika-eval-compare.md`).



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"bom-detector": {}},
+    {"html-encoding-detector": {}},
+    {"mojibuster-encoding-detector": {}},
+    {"junk-filter-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: 
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.
+
+### Per-detector isolation configs
+
+Each detector wired alone lives in `~/data/commoncrawl/cc-html-eval/configs/`:
+`tika-config-bom.json`, `tika-config-html.json`, 
`tika-config-htmlstandard.json`,
+`tika-config-universal.json`, `tika-config-icu4j.json`,

Review Comment:
   This section hard-codes a specific corpus/config directory 
(`~/data/commoncrawl/...`). Using a placeholder path (and/or describing what 
should be in the directory) would make the instructions reusable for others.



##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+  -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-3x-default.json \
+  -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+  --config=tika-config-junkfilter-combiner.json \
+  -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+  -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd 
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"html-encoding-detector": {}},
+    {"universal-encoding-detector": {}},
+    {"icu4j-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+  "encoding-detectors": [
+    {"bom-detector": {}},
+    {"html-encoding-detector": {}},
+    {"mojibuster-encoding-detector": {}},
+    {"junk-filter-encoding-detector": {}}
+  ]
+}
+```
+
+Existing copy: 
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.

Review Comment:
   This doc line references a machine-specific local path (`~/data/smoke/...`). 
Consider using a placeholder (e.g., `<workdir>/...`) so these instructions 
don’t depend on a particular environment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to