tballison commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363772214


##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
      * once when scanning the text (avoiding a redundant binary search per
      * codepoint).
      */
+    /** Small per-bigram log-prob penalty subtracted from the case-folded
+     *  (lowercase) value when scoring an uppercase pair.  All-caps is a 
genuinely
+     *  weaker/rarer signal than lowercase, so it should score a hair BELOW its
+     *  lowercase form, not equal to it — and the margin guards the edge case 
where
+     *  an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+     *  bigram would otherwise score like real lowercase text.  Kept small 
(0.25):
+     *  the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1, 
so
+     *  0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+     *  "somewhat less languagey" principle. */
+    private static final double CASE_FOLD_PENALTY = 0.25;
+
     private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
                                          BigramTables tables) {
+        double direct = Double.NaN;
         if (idxA >= 0 && idxB >= 0) {
             int slot = lookupBigramSlot(tables, idxA, idxB);
             if (slot >= 0) {
-                return dequantize(tables.bigramValues[slot],
+                direct = dequantize(tables.bigramValues[slot],
                         tables.bigramQuantMin, tables.bigramQuantMax);
             }
         }
+        // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant 
of
+        // a SEEN lowercase pair is real text wearing a different case 
(all-caps
+        // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT 
junk.
+        // Score it as the BETTER of its own log-prob and its lowercase twin's 
—
+        // i.e. max(direct, fold).  max (not fold-only-on-miss) is essential: 
real
+        // all-caps bigrams ARE present in training (from headings) but rare, 
so the
+        // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and 
would
+        // otherwise bypass the fold and floor.  This is the discriminator raw
+        // probability cannot be: all-caps real text and all-caps mojibake are 
both
+        // improbable, but only real text has a SEEN lowercase twin.  Gated on 
BOTH
+        // codepoints being uppercase (case-CONSISTENT) so alternating-case 
junk
+        // ("tHiS") stays unfolded and floors; and only the lowercase twin's 
value
+        // is borrowed when that pair is actually seen, so all-caps mojibake
+        // (lowercase form also unseen) floors.
+        // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so 
it
+        // folds both an interior all-caps pair (МУ) AND an edge pair where 
the other
+        // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case 
pair (the
+        // lowercase letter in "aB"/"tHiS" trips the gate, so 
case-inconsistent junk
+        // still floors).  Each uppercase letter is folded; 
sentinels/digits/glue
+        // pass through unchanged.  Folding the edges too is what fully 
rescues short
+        // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the 
rare
+        // uppercase-letter unigram backoff.
+        if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+                && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB)) 
{
+            int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA) 
: cpA;
+            int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB) 
: cpB;

Review Comment:
   Couldn't replicate this. AND other agents disagree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to