Tilman Hausherr created PDFBOX-5600:
---------------------------------------
Summary: applyGsubFeature() doesn't use the longest possible
replacement
Key: PDFBOX-5600
URL: https://issues.apache.org/jira/browse/PDFBOX-5600
Project: PDFBox
Issue Type: Sub-task
Components: FontBox
Affects Versions: 3.0.0 PDFBox
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Fix For: 3.0.0 PDFBox
While working on latin ligatures I noticed that in words like "affluent" only
"ff" was caught but not "ffl".
CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like
(_79_99_)|(_80_99_)|(_92_99_) and makes a regexp out of that.
tokenize finds its match with find(), but not neccessarly the longest.
Thus getRegexFromTokens should sort by reverse length the set that is used by
CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in
getMatchersAsStrings.
This will of course make everything slower; in the long run, maybe we should
rewrite the code so that it doesn't use the regexp logic (although it's a smart
idea), but only after we have more real world test coverage.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]