[jira] [Comment Edited] (TIKA-3005) Unintelligible text content from PDF file

Tilman Hausherr (Jira) Fri, 06 Dec 2019 11:54:48 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990125#comment-16990125
 ]


Tilman Hausherr edited comment on TIKA-3005 at 12/6/19 7:53 PM:
----------------------------------------------------------------

Yeah "identity" is incorrect here, it is just a bad map. But there are other 
files were the assumption is correct so we get a good extraction. With this 
file PDFBox doesn't see this as a "bad extraction".

What you could so is to parse the ToUnicode stream yourself and then make an 
assumption that is different than the voodoo we do in PDFont.loadUnicodeCmap(). 
Of course, Tika would then be slower because the ToUnicode stream would be 
parsed twice.


was (Author: tilman):
Yeah "identity" is incorrect here, it is just a bad map. But there are other 
files were the assumption is correct so we get a good extraction. With this 
file PDFBox doesn't see this as a "bad extraction".

What you could so is to parse the ToUnicode stream yourself and then make an 
assumption that is different than the voodoo we do in PDFont.loadUnicodeCmap(). 
Of course, Tika would then be slower because the cmap would be parsed twice.

> Unintelligible text content from PDF file
> -----------------------------------------
>
>                 Key: TIKA-3005
>                 URL: https://issues.apache.org/jira/browse/TIKA-3005
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.22
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: file1.pdf, file2.pdf, file3.pdf, resume_4.pdf
>
>
> If I get text content from attachment, Tika doesn't fail but the content is 
> unintelligible



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3005) Unintelligible text content from PDF file

Reply via email to