[ https://issues.apache.org/jira/browse/PDFBOX-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721864#comment-17721864 ]
ASF subversion and git services commented on PDFBOX-5600: --------------------------------------------------------- Commit 1909754 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1909754 ] PDFBOX-5600: larger string should appear first, to ensure that "ffl" is used as a replacement and not just "ff" when possible when doing ligatures > applyGsubFeature() doesn't use the longest possible replacement > --------------------------------------------------------------- > > Key: PDFBOX-5600 > URL: https://issues.apache.org/jira/browse/PDFBOX-5600 > Project: PDFBox > Issue Type: Sub-task > Components: FontBox > Affects Versions: 3.0.0 PDFBox > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Priority: Major > Labels: gsub > Fix For: 3.0.0 PDFBox > > > While working on latin ligatures I noticed that in words like "affluent" only > "ff" was caught but not "ffl". > CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings > like (_79_99_)|(_80_99_)|(_92_99_) and makes a regexp out of that. > tokenize finds its match with find(), but not neccessarly the longest. > Thus getRegexFromTokens should sort by reverse length the set that is used by > CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in > getMatchersAsStrings. > This will of course make everything slower; in the long run, maybe we should > rewrite the code so that it doesn't use the regexp logic (although it's a > smart idea), but only after we have more real world test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org