[
https://issues.apache.org/jira/browse/TIKA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
tom hill updated TIKA-3858:
---------------------------
Description:
It appears that the issue in TIKA-1289 is still present. Ligatures get replaced
by a question mark.
As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd
Is there any new resolution on this issue? Just returning the fl ligature would
be great, or normalizing it to f, t.
This particular example comes from saving my gmail inbox page as a pdf, in
chrome. It uses the ft ligature in the word "Drafts".
There are many similar examples, it's not specific to one pdf generator.
was:
It appears that the issue in TIKA-1289 is still present. Ligatures get replaced
by a question mark.
As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd
Is there any new resolution on this issue? Just returning the fl ligature would
be great, or normalizing it to f, t.
> Ligatures convert on text extraction
> -------------------------------------
>
> Key: TIKA-3858
> URL: https://issues.apache.org/jira/browse/TIKA-3858
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.5
> Environment: win 8, jre 1.5
> Reporter: tom hill
> Priority: Major
>
> It appears that the issue in TIKA-1289 is still present. Ligatures get
> replaced by a question mark.
> As a particular example, the ft ligature is getting replaced by utf-8: ef bf
> bd
> Is there any new resolution on this issue? Just returning the fl ligature
> would be great, or normalizing it to f, t.
> This particular example comes from saving my gmail inbox page as a pdf, in
> chrome. It uses the ft ligature in the word "Drafts".
> There are many similar examples, it's not specific to one pdf generator.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)