[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086413#comment-18086413
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
tballison commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363772214
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
* once when scanning the text (avoiding a redundant binary search per
* codepoint).
*/
+ /** Small per-bigram log-prob penalty subtracted from the case-folded
+ * (lowercase) value when scoring an uppercase pair. All-caps is a
genuinely
+ * weaker/rarer signal than lowercase, so it should score a hair BELOW its
+ * lowercase form, not equal to it — and the margin guards the edge case
where
+ * an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+ * bigram would otherwise score like real lowercase text. Kept small
(0.25):
+ * the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1,
so
+ * 0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+ * "somewhat less languagey" principle. */
+ private static final double CASE_FOLD_PENALTY = 0.25;
+
private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
BigramTables tables) {
+ double direct = Double.NaN;
if (idxA >= 0 && idxB >= 0) {
int slot = lookupBigramSlot(tables, idxA, idxB);
if (slot >= 0) {
- return dequantize(tables.bigramValues[slot],
+ direct = dequantize(tables.bigramValues[slot],
tables.bigramQuantMin, tables.bigramQuantMax);
}
}
+ // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant
of
+ // a SEEN lowercase pair is real text wearing a different case
(all-caps
+ // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT
junk.
+ // Score it as the BETTER of its own log-prob and its lowercase twin's
—
+ // i.e. max(direct, fold). max (not fold-only-on-miss) is essential:
real
+ // all-caps bigrams ARE present in training (from headings) but rare,
so the
+ // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and
would
+ // otherwise bypass the fold and floor. This is the discriminator raw
+ // probability cannot be: all-caps real text and all-caps mojibake are
both
+ // improbable, but only real text has a SEEN lowercase twin. Gated on
BOTH
+ // codepoints being uppercase (case-CONSISTENT) so alternating-case
junk
+ // ("tHiS") stays unfolded and floors; and only the lowercase twin's
value
+ // is borrowed when that pair is actually seen, so all-caps mojibake
+ // (lowercase form also unseen) floors.
+ // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so
it
+ // folds both an interior all-caps pair (МУ) AND an edge pair where
the other
+ // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case
pair (the
+ // lowercase letter in "aB"/"tHiS" trips the gate, so
case-inconsistent junk
+ // still floors). Each uppercase letter is folded;
sentinels/digits/glue
+ // pass through unchanged. Folding the edges too is what fully
rescues short
+ // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the
rare
+ // uppercase-letter unigram backoff.
+ if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+ && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB))
{
+ int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA)
: cpA;
+ int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB)
: cpB;
Review Comment:
Couldn't replicate this. AND other agents disagree.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)