Copilot commented on code in PR #2848:
URL: https://github.com/apache/tika/pull/2848#discussion_r3322082410
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -454,21 +508,29 @@ public ScoreResult scoreClassesAndCount(byte[] probe) {
}
// logPs are negative; "best" class for the bigram = highest
- // (least negative) contribution after dequant.
+ // (least negative) contribution after dequant. Cap reference
+ // is the best contribution from a class outside top-1's
+ // cohort, so the cap engages on cross-cohort gaps that a
+ // max-vs-overall-runner-up cap missed when multiple classes
+ // in top-1's cohort sat close together.
+ int topClass = -1;
double max = Double.NEGATIVE_INFINITY;
- double secondMax = Double.NEGATIVE_INFINITY;
for (int c = 0; c < numClasses; c++) {
double contrib = logP8[base + c] * countTimesIdf *
perClassDequant[c];
contributions[c] = contrib;
if (contrib > max) {
- secondMax = max;
max = contrib;
- } else if (contrib > secondMax) {
- secondMax = contrib;
+ topClass = c;
+ }
+ }
+ Cohort topCohort = cohorts[topClass];
+ double bestCrossCohort = Double.NEGATIVE_INFINITY;
+ for (int c = 0; c < numClasses; c++) {
+ if (cohorts[c] != topCohort && contributions[c] >
bestCrossCohort) {
+ bestCrossCohort = contributions[c];
Review Comment:
`bestCrossCohort` can remain `Double.NEGATIVE_INFINITY` when all model
labels fall into the same cohort (or `numClasses==1`). In that case `capValue`
becomes `-INF` and the code will clip all contributions to `-INF`, effectively
destroying scoring for that probe/model. Consider falling back to the overall
runner-up (old behavior) when no cross-cohort class exists, and skipping the
cap entirely when there is no runner-up (single-class model).
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"html-encoding-detector": {}},
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
Review Comment:
This doc line references a personal local path (`~/data/claude-work/...`)
that likely won’t exist for other contributors. Prefer a placeholder work
directory (or drop the “Existing copy” line entirely) so the instructions are
portable.
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
Review Comment:
The example runs `Compare` from `/tmp/tika-eval-current/...`, but the
snippet never shows building/unzipping tika-eval into that location. As
written, the command sequence won’t work in a fresh environment; add a short
build/unzip step (or point to `.skills/tika-eval-compare.md`).
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"html-encoding-detector": {}},
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"bom-detector": {}},
+ {"html-encoding-detector": {}},
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy:
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.
+
+### Per-detector isolation configs
+
+Each detector wired alone lives in `~/data/commoncrawl/cc-html-eval/configs/`:
+`tika-config-bom.json`, `tika-config-html.json`,
`tika-config-htmlstandard.json`,
+`tika-config-universal.json`, `tika-config-icu4j.json`,
Review Comment:
This section hard-codes a specific corpus/config directory
(`~/data/commoncrawl/...`). Using a placeholder path (and/or describing what
should be in the directory) would make the instructions reusable for others.
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"html-encoding-detector": {}},
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"bom-detector": {}},
+ {"html-encoding-detector": {}},
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy:
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.
Review Comment:
This doc line references a machine-specific local path (`~/data/smoke/...`).
Consider using a placeholder (e.g., `<workdir>/...`) so these instructions
don’t depend on a particular environment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]