Tilman Hausherr created PDFBOX-5600: ---------------------------------------
Summary: applyGsubFeature() doesn't use the longest possible replacement Key: PDFBOX-5600 URL: https://issues.apache.org/jira/browse/PDFBOX-5600 Project: PDFBox Issue Type: Sub-task Components: FontBox Affects Versions: 3.0.0 PDFBox Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 3.0.0 PDFBox While working on latin ligatures I noticed that in words like "affluent" only "ff" was caught but not "ffl". CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like (_79_99_)|(_80_99_)|(_92_99_) and makes a regexp out of that. tokenize finds its match with find(), but not neccessarly the longest. Thus getRegexFromTokens should sort by reverse length the set that is used by CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in getMatchersAsStrings. This will of course make everything slower; in the long run, maybe we should rewrite the code so that it doesn't use the regexp logic (although it's a smart idea), but only after we have more real world test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org