[jira] [Created] (PDFBOX-5600) applyGsubFeature() doesn't use the longest possible replacement

Tilman Hausherr (Jira) Thu, 11 May 2023 11:14:11 -0700

Tilman Hausherr created PDFBOX-5600:
---------------------------------------


             Summary: applyGsubFeature() doesn't use the longest possible 
replacement
                 Key: PDFBOX-5600
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5600
             Project: PDFBox
          Issue Type: Sub-task
          Components: FontBox
    Affects Versions: 3.0.0 PDFBox
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 3.0.0 PDFBox


While working on latin ligatures I noticed that in words like "affluent" only 
"ff" was caught but not "ffl".

CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like 
(_79_99_)|(_80_99_)|(_92_99_) and makes a regexp out of that.

tokenize finds its match with find(), but not neccessarly the longest.

Thus getRegexFromTokens should sort by reverse length the set that is used by 
CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in 
getMatchersAsStrings.
This will of course make everything slower; in the long run, maybe we should 
rewrite the code so that it doesn't use the regexp logic (although it's a smart 
idea), but only after we have more real world test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5600) applyGsubFeature() doesn't use the longest possible replacement

Reply via email to