[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084252#comment-18084252
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
Copilot commented on code in PR #2848:
URL: https://github.com/apache/tika/pull/2848#discussion_r3322082410
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/NaiveBayesBigramEncodingDetector.java:
##########
@@ -454,21 +508,29 @@ public ScoreResult scoreClassesAndCount(byte[] probe) {
}
// logPs are negative; "best" class for the bigram = highest
- // (least negative) contribution after dequant.
+ // (least negative) contribution after dequant. Cap reference
+ // is the best contribution from a class outside top-1's
+ // cohort, so the cap engages on cross-cohort gaps that a
+ // max-vs-overall-runner-up cap missed when multiple classes
+ // in top-1's cohort sat close together.
+ int topClass = -1;
double max = Double.NEGATIVE_INFINITY;
- double secondMax = Double.NEGATIVE_INFINITY;
for (int c = 0; c < numClasses; c++) {
double contrib = logP8[base + c] * countTimesIdf *
perClassDequant[c];
contributions[c] = contrib;
if (contrib > max) {
- secondMax = max;
max = contrib;
- } else if (contrib > secondMax) {
- secondMax = contrib;
+ topClass = c;
+ }
+ }
+ Cohort topCohort = cohorts[topClass];
+ double bestCrossCohort = Double.NEGATIVE_INFINITY;
+ for (int c = 0; c < numClasses; c++) {
+ if (cohorts[c] != topCohort && contributions[c] >
bestCrossCohort) {
+ bestCrossCohort = contributions[c];
Review Comment:
`bestCrossCohort` can remain `Double.NEGATIVE_INFINITY` when all model
labels fall into the same cohort (or `numClasses==1`). In that case `capValue`
becomes `-INF` and the code will clip all contributions to `-INF`, effectively
destroying scoring for that probe/model. Consider falling back to the overall
runner-up (old behavior) when no cross-cohort class exists, and skipping the
cap entirely when there is no runner-up (single-class model).
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"html-encoding-detector": {}},
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
Review Comment:
This doc line references a personal local path (`~/data/claude-work/...`)
that likely won’t exist for other contributors. Prefer a placeholder work
directory (or drop the “Existing copy” line entirely) so the instructions are
portable.
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
Review Comment:
The example runs `Compare` from `/tmp/tika-eval-current/...`, but the
snippet never shows building/unzipping tika-eval into that location. As
written, the command sequence won’t work in a fresh environment; add a short
build/unzip step (or point to `.skills/tika-eval-compare.md`).
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"html-encoding-detector": {}},
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"bom-detector": {}},
+ {"html-encoding-detector": {}},
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy:
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.
+
+### Per-detector isolation configs
+
+Each detector wired alone lives in `~/data/commoncrawl/cc-html-eval/configs/`:
+`tika-config-bom.json`, `tika-config-html.json`,
`tika-config-htmlstandard.json`,
+`tika-config-universal.json`, `tika-config-icu4j.json`,
Review Comment:
This section hard-codes a specific corpus/config directory
(`~/data/commoncrawl/...`). Using a placeholder path (and/or describing what
should be in the directory) would make the instructions reusable for others.
##########
.skills/tika-eval-encoding-regression.md:
##########
@@ -0,0 +1,171 @@
+# tika-eval for encoding-detector regression hunts
+
+A condensed pattern for finding SBCS→CJK style charset-detector regressions
+(or any "A picks encoding X, B picks encoding Y" question) without
+building two tika-app distributions.
+
+## Two configs, one build
+
+Encoding-detector experiments don't need a "before" and "after" tika-app —
+the chain composition is per-config. Run the SAME tika-app twice against
+two configs, treat the outputs as `-a` and `-b`. Much faster than
+`tika-eval-compare`'s two-build flow.
+
+```bash
+# build once
+./mvnw clean install -pl tika-app -am -Pfast -DskipTests \
+ -Dmaven.repo.local=$(pwd)/.local_m2_repo
+unzip -q tika-app/target/tika-app-*.zip -d /tmp/tika-app-current
+
+# two configs (any combination of detectors)
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-3x-default.json \
+ -i <corpus> -o ~/data/extracts/A -n 6
+java -jar /tmp/tika-app-current/tika-app-*.jar \
+ --config=tika-config-junkfilter-combiner.json \
+ -i <corpus> -o ~/data/extracts/B -n 6
+
+# normal Compare
+java -jar /tmp/tika-eval-current/tika-eval-app-*.jar Compare \
+ -a ~/data/extracts/A -b ~/data/extracts/B -d ~/data/extracts/A-vs-B -r -rd
~/data/extracts/A-vs-B-reports
+```
+
+### Canonical 3.x-default encoding chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"html-encoding-detector": {}},
+ {"universal-encoding-detector": {}},
+ {"icu4j-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy: `~/data/claude-work/tika-config-3x-default.json`.
+
+### Canonical 4.x junkfilter chain config
+
+```json
+{
+ "encoding-detectors": [
+ {"bom-detector": {}},
+ {"html-encoding-detector": {}},
+ {"mojibuster-encoding-detector": {}},
+ {"junk-filter-encoding-detector": {}}
+ ]
+}
+```
+
+Existing copy:
`~/data/smoke/eval-runtime/tika-config-junkfilter-combiner.json`.
Review Comment:
This doc line references a machine-specific local path (`~/data/smoke/...`).
Consider using a placeholder (e.g., `<workdir>/...`) so these instructions
don’t depend on a particular environment.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)