[ https://issues.apache.org/jira/browse/PDFBOX-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr resolved PDFBOX-5600. ------------------------------------- Resolution: Fixed > applyGsubFeature() doesn't use the longest possible replacement > --------------------------------------------------------------- > > Key: PDFBOX-5600 > URL: https://issues.apache.org/jira/browse/PDFBOX-5600 > Project: PDFBox > Issue Type: Sub-task > Components: FontBox > Affects Versions: 3.0.0 PDFBox > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Priority: Major > Labels: gsub > Fix For: 3.0.0 PDFBox > > > While working on latin ligatures I noticed that in words like "affluent" only > "ff" was caught but not "ffl". > CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like > {noformat} > (_79_99_)|(_80_99_)|(_92_99_) > {noformat} > and makes a regexp out of that. > tokenize finds its match with find(), but not neccessarly the longest. > Thus getRegexFromTokens should sort by reverse length the set that is used by > CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in > getMatchersAsStrings. > This will of course make everything slower; in the long run, maybe we should > rewrite the code so that it doesn't use the regexp logic (although it's a > smart idea!), but only after we have more real world test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org